esm2a and esm2b course notes - jacobs university bremen · 1.2 esm2b - fourier analysis in...

ESM2A and ESM2B

Course Notes

M. Stadlbauer

Spring term 2008

Jacobs University Bremen

June 4, 2008

Contents

1 Introduction 41.1 ESM2A - t-test by example . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 ESM2B - Fourier analysis in JPG-compression . . . . . . . . . . . . . . . . 5

2 Linear algebra 72.1 Basis and dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Linear independence . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.2 Basis and dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.3 Inner products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Linear operators and matrix algebra . . . . . . . . . . . . . . . . . . . . . . 142.2.1 Matrices and linear operators . . . . . . . . . . . . . . . . . . . . . . 162.2.2 Matrix algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.3 Solving systems of linear equations, and inverting matrices . . . . . . 202.2.4 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3 Endomorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.3.1 Change of base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.3.2 Eigenvectors and eigenvalues . . . . . . . . . . . . . . . . . . . . . 372.3.3 Spectral theorem for symmetric matrices . . . . . . . . . . . . . . . 412.3.4 Linear groups (ESM2B) . . . . . . . . . . . . . . . . . . . . . . . . 442.3.5 Spectral theorem for normal matrices (ESM2B) . . . . . . . . . . . . 462.3.6 The Jordan normal form (ESM2B) . . . . . . . . . . . . . . . . . . . 47

3 Probability theory 533.1 Basic notions of set theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.2 Discrete probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.2.1 Combinatorics, and uniform sample spaces . . . . . . . . . . . . . . 573.2.2 Conditional probabilities . . . . . . . . . . . . . . . . . . . . . . . . 613.2.3 Important discrete probability spaces . . . . . . . . . . . . . . . . . 63

3.3 Continuous probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . 663.3.1 Measurability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.3.2 Densities and distribution functions . . . . . . . . . . . . . . . . . . 673.3.3 Probability measures on R and generalized functions (ESM 2B). . . . 693.3.4 Important continuous probability spaces . . . . . . . . . . . . . . . . 69

3.4 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

2

3.4.1 Sums of independent random variables . . . . . . . . . . . . . . . . 743.5 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763.6 Limit theorems for sums of independent random variables . . . . . . . . . . 80

3.6.1 The law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . 803.6.2 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . 81

4 Statistics (ESM 2A) 844.1 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.1.1 Estimators for expectation and variance . . . . . . . . . . . . . . . . 844.1.2 Estimators for quantiles . . . . . . . . . . . . . . . . . . . . . . . . 86

4.2 Statistics and their distributions . . . . . . . . . . . . . . . . . . . . . . . . . 884.3 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.4 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.4.1 Hypothesis tests for univariate data . . . . . . . . . . . . . . . . . . 964.4.2 Hypothesis tests for bivariate data . . . . . . . . . . . . . . . . . . . 974.4.3 χ2-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 984.4.4 Single factor analysis of variance . . . . . . . . . . . . . . . . . . . 101

4.5 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5 Fourier analysis (ESM 2B) 1045.1 Banach and Hilbert spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.1.1 Orthonormal bases for separable Hilbert spaces . . . . . . . . . . . . 1095.2 Fourier series in L2([0,L]) . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.2.1 The Fourier basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.2.2 Fourier coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.3 Fourier series for periodic functions . . . . . . . . . . . . . . . . . . . . . . 1145.3.1 Function classes with stronger convergence properties . . . . . . . . 1175.3.2 Fourier coefficients of even and odd functions . . . . . . . . . . . . . 1185.3.3 Expanding functions to continuous periodic functions . . . . . . . . . 1195.3.4 An application of Fourier series to partial differential equations . . . 121

5.4 The Fourier transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225.4.1 The Dirichlet and Fejer Kernel . . . . . . . . . . . . . . . . . . . . . 1235.4.2 Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1265.4.3 The Fourier transform . . . . . . . . . . . . . . . . . . . . . . . . . 1275.4.4 The inversion formula and Plancherel’s theorem . . . . . . . . . . . . 1295.4.5 An application of the Fourier transform to partial differential equations 130

5.5 The δ -function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

3

Chapter 1

Introduction

These notes cover the topics of ESM2A and ESM2B. Since some parts of the courses aresimilar, a single text is provided. Those parts, which are relevant only for one course aremarked accordingly. Please submit corrections and suggestions for further improvement tothe instructor.

1.1 ESM2A - t-test by exampleBefore starting with our course, we first give a motivating example for the use of statisticalinterference (a topic which often is disliked by students). Assume we have observed thefollowing data concerning the starting salaries from students who obtained a degree fromUniversity A or B

University A University B46487 4429134234 4506721705 4940949239 3773542709 4937553793 4794855846 4462645036 4304443287 4786157068 56215

A number, which may measure the difference in average between the starting salaries, is thedifference between the mean values of each group, that is the difference of 44940.40 (meanvalue w.r.t. University A) and 46557.10 (mean value w.r.t. University B).

The following question arises: Do students from University A earn more than students fromUniversity B? The aim of statistics (or statistical interference) is now to give an answer to thisquestion for a given error probability (= probability of a wrong decision). To be more precise,there are two possibilities.

4

(i) We can give an answer, which is signifikant with level α (that is the error probability issmaller than α , e.g. α = 0.1, which means that the error probability is 10%).

(ii) We can give no signifikant answer. In this case, we may conclude that there is nodifference (this decision is not signifikant). But perhaps we can reveal a difference if wewould hace access to more observations.

A statistician in this case usually employs the so called t-Test using a computer (in here, it isdone using the open-source software ‘R’), and obtains the following output:

Welch Two Sample t-test

data: z[, 1] and z[, 2]

t = -0.4352, df = 12.597, p-value = 0.6707

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-9667.403 6434.003

sample estimates:

mean of x mean of y

44940.4 46557.1

Since 0.1 is smaller than 0.6707 (from the output: p-value = 0.6707), he concludes thatthere is no significant difference of level 0.1.

1.2 ESM2B - Fourier analysis in JPG-compressionThe expansion of a function on a bounded interval in its Fourier series is informally the de-composition of the function in its frequency components. The idea behind compression to jpg,mp3 or mpeg is the following.

(i) Cut the data into small pieces.jpg-example: Consider each color channel R,G and B separately, and divide the pictureinto pieces of 8×8 pixels

(ii) Do a Fourier expansion for each of these small pieces.jpg-example: Fourier expansion for the functions of type

(i, j) : 1≤ i, j ≤ 8→ R, (i, j) 7→ F(i, j),

where F(i, j) denotes the value of the color channel R of pixel (i, j), or of channels G,Brespectively.

(iii) Compress the data according to the significance of the corresponding frequency.jpg-example: Encode a frequency with 1 to 32 bit, where the value is chosen accordingto significance. Since the range of of F is finite (usually the color channels are encodedby 32 bit, hence the range has at most 232 elements)

(iv) Do a smoothing procedure at the boundaries of the pieces obtained in (i).

5

NotationsWe now fix the following sets and notations.

N := 1,2, . . .N0 := 0,1,2, . . .Z := 0,1,−1,2,−2, . . .

Q := pq

: p ∈ Z,q ∈ N

R := limn→∞

(xn) : (xn) a sequence in R s.t. limn→∞

(xn) exists

C := x+ iy : x,y ∈ R

Now let X be a set, and n ∈ N. Then

Xn :=

x1

...xn

: x1, . . . ,xn ∈ X

is called the set of column vectors.

Mn×m(X) :=

x11 x12 · · · x1mx21 x22 · · · x2m...

xn1 xn2 · · · xnm

: x11, . . . ,xnm ∈ X

is the set of n×m matrices. Important examples are Rn,Mn×m(R). A function f which mapselements from the set X to the set Y (that is for each x ∈ X , there is a unique element f (x)∈Y )will be denoted by

f : X → Y, x 7→ f (x).

Note that this notion does not imply that f (X) = Y . Well known examples from high schoolare

f : R→ R, x 7→ x3 + x+1g : R→ R, x 7→ ex

h : (0,∞)→ R, x 7→ ln(x)

6

Chapter 2

Linear algebra

A mathematician will tell You that linear algebra is the theory of vector spaces and linear oper-ators (which are maps from one vector space to another and who preserve the linear structure).Vector spaces are used to obtain mathematical models for the following objects.

(i) 3-dimensional space, or in general higher dimensional spaces by Rn.

(ii) Signals in continuous time by the space of all functions from R→ R (in fact, suitablesubspaces of these are considered, because the space of all functions is too big to han-dle).

(iii) Independent random variables.

Moreover, the theory of linear operators applies to linear equations, Fourier analysis and prob-ability theory. We now give the abstract definition of a vector space.

So let F be either C or R. Informally, a F-vector space is a set, where You can add andsubtract elements, and moreover multiply an element with some λ ∈ F. The precise definitionis given below, where we consider + as an abstract map V ×V → V,(u,v) 7→ u + v, and · asF×V →V,(λ ,v)→ λ · v.

Definition 2.0.1 (Vector space). Let V be a set such that,

(i) for all u,v,w ∈V ,

u+ v = u+ v, (Abelian)(u+ v)+w = u+(v+w), (associative)

(ii) there exists 0 ∈V such that 0+ v = v for all v ∈V ,

(iii) and for all λ ,µ ∈ F, u,v ∈V ,

λ (u+ v) = λ · (u+ v),(λ + µ)v = λ · v+ µ · v,0 · v = 0.

Then V is called a vector space over F or F-vector space.

7

The operations “+” and “·” are sometimes called vector addition and scalar multiplication,respectively. Note that conditions (i)-(iii) ensure that one can add and subtract elements of Vas usual. Moreover, one can multiply from the left (but only from the left) with elements fromF in the usual manner. We now give several examples of vector spaces.

Example 1. The basic model for n-dimensional (Euclidean) space is given by V = Rn, wherex1...

xn

+

y1...

yn

:=

x1 + y1...

xn + yn

, λ ·

x1...

xn

:=

λx1...

λxn

,

for

x1...

xn

,

y1...

yn

∈ Rn,λ ∈ R.

Example 2. Let C([0,1]) be the space of continuous functions, that is f : [0,1]→ R is anelement of C([0,1]) if f is continuous. Set

f +g : [0,1]→ R, x 7→ f (x)+g(x), for f ,g ∈C([0,1]),

andλ · f : [0,1]→ R, x 7→ λ f (x), for f ∈C([0,1]),λ ∈ R,

and let 0∈C([0,1]) be the function which is equal 0 for all x∈ [0,1]. Then C([0,1]) is a vectorspace, since f +g and λ f are continuous functions.

Example 3 (ESM2B). Let Cn([0,1]) be the space of n-times continuous differentiable func-tions, that is f : [0,1]→ R is an element of Cn([0,1]), if for all k = 1, . . .n, the k-th derivativef (k) is continuous. Then with the same definitions as in the previous example, Cn([0,1]) is avector space. This can be seen as follows. For k = 1, . . .n, the k-th derivatives of f + g andλ f are given by ( f + g)(k) = f (k) + g(k) and (λ f )(k) = λ · f (k), respectively. In particular,( f +g)(k) and (λ f )(k) are continuous, and hence f +g and λ f are in Cn([0,1]).

2.1 Basis and dimensionIn the sequel, let F be either R or C. By definition of a vector space over F , each finite linearcombination of elements of a vector space V is again an element of the vector space. That is,for all λ1,→ λn ∈ F, and v1, . . . ,vn ∈V ,

n

∑i=1

λivi = λ1v1 + · · ·+λnvn ∈V. (2.1)

An expression of type (2.1) is called a (finite) linear combination of v1, . . . ,vn. The questionnow arises whether each element of V can be represented by linear combinations of a fixedset v1, . . . ,vn, and whether there are minimal sets of this kind. This leads to the followingconsiderations and objects.

8

Definition 2.1.1. Let V be a vector space over F, and let v1, . . . ,vn ∈ V elements of V . Thenthe span of v1, . . . ,vn is defined by

span(v1, . . . ,vn) :=

n

∑i=1

λivi : λ1, . . .λn ∈ F

.

Remark (ESM2B). This definition can be generalized to the span of arbitrary subsets A of Vby defining the span as set of all finite linear combinations of elements of A:

span(A) :=

n

∑i=1

λivi : n ∈ N, v1, . . . ,vn ∈ A, λ1, . . .λn ∈ F

.

Note that the above definition can be recovered by setting A := v1, . . . ,vn.

By definition, span(v1, . . . ,vn) ⊂ V . Moreover, it will turn out, that the span is also avector space. To cover this in more generality we define the following.

Definition 2.1.2. Let V be a vector space (over F). Then a subset U ⊂V is called subspace ofV if for all v,w ∈U, λ ∈ F,

v+w ∈U, and λv ∈U.

There are several immediate implications of this definition for a subspace U of V .

(i) Since 0 ∈ F, and 0v = 0 ∈U for all v ∈U , it follows that 0 ∈U .

(ii) U itself is a vector space: U is closed under vector addition and scalar multiplication.Hence all the properties of a vector space are inherited from the corresponding propertiesof V (check this as an exercise).

(iii) Let v1, . . . ,vn ∈V . Then span(v1, . . . ,vn) is a subspace of V .

2.1.1 Linear independenceAs mentioned above, it would be of interest, if for some subset B of A, we have span(B) =span(A). That is, we wish to extract relevant elements of A. We will show that this is relatedto the notion of linear independence.

Definition 2.1.3. Let V be a vector space, and v1, . . . ,vn ∈ V . Then the set v1, . . . ,vn iscalled linearly independent (or abbreviated: v1, . . . ,vn are called linearly independent), if

n

∑i=1

λivi = λ1v1 + · · ·+λnvn = 0 (λ1, . . .λn) ∈ F

implies that λ1 = λ2 = · · ·= λn = 0.

The notion of linear independence aims at the unique presentation of a vector space in termsof column vectors with entries in F. More precisely, the notion of independence and the spanare related as follows.

9

Proposition 2.1.4. Let v1, . . . ,vn ∈V . The following are equivalent.

(i) v1, . . . ,vn are linear independent

(ii) For each w ∈ span(v1, . . . ,vn), there exists unique elements λ1, . . . ,λn ∈ F such that

n

∑i=1

λivi = w.

The proof is easy, and is included since it provides insight to the abstract concept of linearindependence.

Proof. Step 1: (i) implies (ii). This is easy to see by contradiction: assume that there existλ1, . . . ,λn ∈ F and µ1, . . . ,µn ∈ F such that

w =n

∑i=1

λivi =n

∑i=1

µivi.

Thenn

∑i=1

λivi−n

∑i=1

µivi =n

∑i=1

(λi−µi)vi = w−w = 0.

Hence by linear independence, (λi−µi) = 0 for i = 1, . . .n. In particular, λi = µi for i = 1, . . .n.Step 2: (ii) implies (i). For 0 ∈V , there exist uniquely determined λ1, . . . ,λn ∈ F with

0 =n

∑i=1

λivi.

Since ∑ni=1 0vi = 0, we have λ1 = · · ·= λn = 0.

We first give several examples for the spaces Fn. For these, it is often useful to compare thestatements with geometrical intuition.

Example 4. Linear dependent vectors in R2. Let

v1 :=(

10

),v2 :=

(01

),v3 :=

(2

1/3

).

These vectors are dependent, since

2v1 +13

v2 = v3 ⇐⇒ 2v1 +13

v2− v3 = 0.

Example 5. Linear independent vectors in R2. Let

v1 :=(

10

),v2 :=

(21

)These vectors are independent, since

λ1v1 +λ2v2 = 0 ⇒ λ2 ·1 = 0 ⇒ λ2 = 0 ⇒ λ1 ·1+0 · · ·2 = 0 ⇒ λ1 = 0.

10

Example 6. For k = 1, . . .n, let

ek :=

0...1...0

← k-th entry.

The set e1, . . . ,en is linearly independent, and span(e1, . . . ,en) = Fn. This set is called thestandard basis of Fn (for the definition of a basis, see below). 1

Example 7. Let Pm the set of polynomials of degree smaller than or equal to m. Then the set1,x,x2, . . .xm is linearly independent.

Example 8 (ESM2B). fn : [0,1]→ C, x 7→ e2πinx

2.1.2 Basis and dimensionIn Example 6, we have seen that the standard basis is linearly independent, and that the spanis equal to Fn. These are main features of a basis. For general vector spaces, we generalize asfollows.

Definition 2.1.5. A finite subset v1, . . . ,vn of the vector space V is called a basis of V if

(i) span(v1, . . . ,vn) = V ,

(ii) the set v1, . . . ,vn is linearly independent.

One might guess, that in the above definition, n is the dimension of the vector space. Inorder to do so, one must ensure that the the cardinality of a basis does not depend on thechoice.

Proposition 2.1.6. Two bases of a vector space have the same cardinality.

Proof (sketch). Assume that v1, . . .vn and w1, . . .wm are bases of V . Now replace oneelement of v1, . . .vn by an element of w1, . . .wm. Then iterate this procedure.

This result now implies that the following definition makes sense (Mathematicians say thatthe dimension is a well-defined object).

Definition 2.1.7. Let V be a vector space which admits a basis. Then V is called a finitedimensional vector space. The cardinality of the basis is called the dimension of V , and willbe denoted by dim(V ).

Example 9. We have dim(Fn) = n by Example 6.

1More precisely, if one considers e1, . . . ,en as a subset of Rn, it is called the standard basis of Rn, and oneconsiders e1, . . . ,en as a subset of Cn, it is called the standard basis of Cn.

11

Example 10. Let Pm := a0 + a1x + a2x2 + amxm : a0, . . .am ∈ F be the set of all polyno-mials of degree smaller than or equal than m ∈ N. This space is a F-vector space, and1,x,x2, . . . ,xm is a basis. Hence dim(Pm) = m + 1, and in particular, the vector space ofall polynomials is not finite dimensional.

Example 11. C([0,1]) is infinite dimensional.

Summarizing the results of this section, we are now able to characterize a basis of a finitedimensional vector space, and obtain a representation of elements of the vector space as linearcombinations of a basis.

Theorem 2.1.8. A finite subset A of the finite dimensional vector space V is a basis if and onlyif A is linearly independent, and dim(V ) = ]A (where ]A denotes the number of elements ofA).

Moreover, if v1, . . .vn is a basis of V , then for each v∈V , there exist uniquely determinedλ1, . . . ,λn such that

v = λ1v1 +λ2v2 + · · ·λnvn.

As an immediate consequence, we obtain that more than n vectors of a n-dimensional vectorspace have to be linearly dependent (see e.g. Example 4). Moreover, by choosing a basis ofa n-dimensional vector space, we are now able to treat elements of the space explicitly aselements of Fn. In particular, all n-dimensional vector spaces are essentially the same. Aprecise statement of this observation will be derived using linear operators.

2.1.3 Inner productsWe now introduce the notion of an inner product. This object will enable us to define (orrecover) notions of distances, angles etc. for vector spaces.

Definition 2.1.9 (ESM2A). Let V be a finite dimensional vector space (over R). A map (·, ·) :V ×V → R is called an inner product, if for all u,v,w ∈V , and λ ∈ R,

(i) (v,w) = (w,v),

(ii) (u+ v,w) = (u,w)+(v,w),

(iii) (λv,w) = λ (v,w),

(iv) (v,v) > 0 for all v 6= 0.

Definition 2.1.10 (ESM2B). Let V be a finite dimensional vector space (over F = R or C). Amap (·, ·) : V ×V → F is called an inner product, if for all u,v,w ∈V , and λ ∈ F,

(i) (v,w) = (w,v),

(ii) (u+ v,w) = (u,w)+(v,w),

(iii) (λv,w) = λ (v,w),

12

(iv) (v,v) > 0 for all v 6= 0.

Note that, using (i) and (iii), it follows that (v,λw) = λ (v,w) The following objects arisefrom the notion of inner product. For v,w ∈ V , ‖v‖ :=

√(v,v) is called the norm (or length)

of v ∈V , and ‖v−w‖ is called the distance between v and w. Moreover, two vectors v,w ∈Vare called orthogonal, if (v,w) = 0. With these notions, it is easy to “prove” the Pythagoreantheorem. For v,w ∈V orthogonal,

‖v−w‖2 = (v−w,v−w) = (v,v−w)− (w,v−w) = (v,v)− (v,w)− (w,v)+(w,w)

= ‖v‖2 +‖w‖2.

Remark (ESM2B) Moreover, we obtain an easy proof of the Cauchy-Schwarz-inequality, thatis, for u,v ∈V we have

|(u,v)| ≤ ‖u‖‖u‖.

This can be seen by the following argument. Assume that u,v 6= 0, and note that

(u,v−λu) = 0 ⇐⇒ (u,v) = (u,λu) ⇐⇒ λ =(u,v)‖u‖2 .

Using the Pythagorean theorem, it follows that

‖v‖2 = ‖λu‖2 +‖v−λu‖2 ≥ ‖λu‖2

= (λu,λu) = λλ‖u‖2 = |λ |2‖u‖2 =|(u,v)|2

‖u‖2 .

The assertion follows from this.We now give the standard example of the inner product for Fn. Let v = ∑

ni=1 λiei, w =

∑ni=1 λiei elements of Fn, where e1, . . .en refers to the standard basis. Then the standard

inner product is defined by

(v,w) :=n

∑i=1

λiµi.

We remark that in case n = 3 the distance introduced above extends the well-known definitionof a distance in R3. Since

‖v‖2 = λ21 +λ

22 +λ

23 ,

it follows that ‖v‖ is the length of the diagonal of the cuboid with sides of length λ1,λ2,λ3.Hence we may use the cosine rule to relate the angle between two vectors with the scalar

product. The cosine rule tells that for a triangle with sides a,b,c and angle α between a andb, we have

c2 = a2 +b2−2abcosα.

In terms of the inner product, it follows, where α denotes the angle between v,w∈Rn, v,w 6= 0that

‖u‖2 +‖v‖2−2‖u‖‖v‖cosα = ‖u− v‖2 = (u− v,u− v) = (u,u)−2(u,v)+(v,v).

13

In particular,

cosα =(u,v)‖u‖‖v‖

.

If we apply this fact to α = ±π/2 (or ±90 degree), we see that the orthogonality definedabove coincides with the one we know from elementary geometry.

An aspect of orthogonality in general vector spaces which might be more important for thetheory is the following. Let v1, . . .vn ∈ V be pairwise orthogonal vectors unequal to 0, andassume that

λ1v1 + · · ·+λnvn = 0 (λ1, . . .λn ∈ R).

This implies that0 = (vi,λ1v1 + · · ·+λnvn) = λi(vi,vi),

and hence λi = 0. In particular, v1, . . .vn are independent!

Definition 2.1.11. A basis v1, . . .vn of V is called orthogonal, if the vectors are pairwise or-thogonal. If in addition, ‖vi‖= 1 for all i = 1, . . .n, then the basis is called orthonormal.

If v1, . . .vn is an orthonormal basis for V , the inner product with respect to the representationin terms of the basis can be expressed as follows.

(n

∑i=1

λivi,n

∑j=1

µ jv j) =n

∑i, j=1

λiµ j(ei,e j) =n

∑i=1

λiµi. (2.2)

In particular, by considering finite dimensional vector spaces with respect to an orthonormalbasis, the associated inner product is the standard inner product with respect to this basis.

2.2 Linear operators and matrix algebraInformally, a linear operator is a map from one vector space to another which preserves thelinear structure. These maps are used to relate vector spaces (e.g. a n-dimensional vectorspace with Fn), and to each system of linear equations, there is an associated linear operator.

Definition 2.2.1. Let V,W be F-vector spaces. A linear operator L is a map L : V →W suchthat for all v,w ∈V , λ ∈ F:

L(v+w) = L(v)+L(w), L(λv) = λL(v).

Example 12. Let V be a finite dimensional vector space, and e1, . . . ,en an orthonormal basisof V . Then the i-th coordinate map, defined by

πi : V → F,v 7→ (ei,v),

is a linear operator. In particular, for v = ∑λiei, it follows that

n

∑j=1

π j(v)e j = ∑j(e j,∑λiei)e j = ∑

jλ je j.

14

Definition 2.2.2. Let L : V →W be a linear operator. The the set

ker(L) := v ∈V : L(v) = 0)

is called the kernel of L, and the set

img(L) := L(V ) = L(v) ∈W : v ∈V)

is called the image of L.

As it easily can be seen, kernel and image of a linear operator are subspaces of V and W ,respectively. Note that a linear operator L is one-one if and only if ker(L) = 0. This can beseen from the following argument.

(i) Assume L is one-one. Then L(v) = 0 implies v = 0 (since L(0) = 0)

(ii) Assume ker = 0. Then L(v) = L(w) implies L(v)− L(w) = L(v−w) = 0. Hencev−w ∈ ker(L) = 0.

In case the two associated vector spaces are finite dimensional, the dimensions of ker(L) andimg(L) are finite (and well-defined). In particular, dim(img(L)) is called the rank of L.

Example 13. Let L : R2→ R2 be the linear operator given by(xy

)=(

4x+2y2x+ y

).

Then

ker(L) =(

xy

): 4x+2y = 0, 2x+ y = 0

=(

xy

): ((

21

),

(xy

)) = 0

.

Set w :=(

21

). We now decompose the vector v ∈ R2 as v = λw + v′. Then v′ = v−λw,

and(v′,w) = (v,w)−λ (w,w) = (v,w)−λ‖w‖2.

For λ = (v,w)/‖w‖2, it follows that (v′,w) = 0.2 In particular, v′ ∈ ker(L). Hence

L(v) = L(λw+ v′) = L(λw)+L(v′) = λL(w) = λ

(105

)=

(v,w)5

(105

)= (v,w)

(21

)We have shown that

img(L) = λw : λ ∈ R .

Note that dim(img(L))+ dim(ker(L)) = 1 + 1 = 2 = dim(V ). The generalization of this ob-servation holds for general linear operators of finite dimensional vector spaces.

2This is the same argument as in the proof of the Cauchy-Schwarz inequality

15

Theorem 2.2.3 (Dimension formula for linear operators). Let L : V →W be a linear operatorof finite dimensional vector spaces. Then

dim(img(L))+dim(ker(L)) = dim(V ).

Proof (ESM2B).. Choose a basis b1, . . .bk of ker(L), and extend this basis by c1, . . .cl to abasis of V . Then the image of v = ∑λibi +∑ µ jc j is equal to

L(v) = ∑λiL(bi)+∑µ jL(c j) = ∑µ jL(c j).

We now show that dim(img(L)) = l by verifying that L(c1), . . .L(cl) is a basis of img(L).

(i) Span(L(c1), . . .L(cl)) = img(L) by definition.

(ii) Linear independence: Assume that ∑ µ jL(c j) = 0. Hence L(∑ µ j(c j)) = 0, and in par-ticular, ∑ µ j(c j)) ∈ ker(L). But this is only possible, if µ1 = · · ·µl = 0.

Hence dim(img(L)) = l. Since dim(V ) = k + l, the assertion follows.

2.2.1 Matrices and linear operatorsLet L be a linear operator from V to W (V,W finite dimensional). We now fix a basis v1, . . .vmof V and a basis w1, . . .wn of W . Then, since w1, . . .wn is a basis, there exist ai j ∈ F(1≤ i≤ m, 1≤ j ≤ n) with

L(v1) =m

∑i=1

ai1wi = a11w1 + · · ·+am1wm

...

L(v j) =m

∑i=1

ai jwi = a1 jw1 + · · ·+am jwm

...

L(vn) =m

∑i=1

ainwi = a1nw1 + · · ·+amnwm

Since each element v ∈V has a unique representation v = λ1v1 + · · ·λnvn, we obtain, using thelinearity of L, that (using the symbol ∑ abbreviates the notation significantly)

L(v) = L(λ1v1 + · · ·λnvn) = λ1L(v1)+ · · ·λnL(vn)

= λ1

(m

∑i=1

ai1wi

)+ · · ·λn

(m

∑i=1

ai1wi

)

=n

∑j=1

λ j

m

∑i=1

ai jwi =m

∑i=1

(n

∑j=1

λ jai j

)wi

=

(n

∑j=1

λ ja1 j

)w1 + · · ·+

(n

∑j=1

λ jam j

)wm.

16

Hence, the linear operator L is completely determined by the elements (ai j). To keep the datain a more organized way, one defines the following product of an m× n-Matrix and a vectorin Fn:

a11 a12 · · · a1na21 a22 · · · a2n

...am1 am2 · · · amn

·

λ1λ2...

λn

:=

∑

nj=1 a1 jλ j

∑nj=1 a2 jλ j

...∑

nj=1 am jλ j

By comparing this with the above expression of L, one verifies immediately that the coeffi-cients of L(v) in terms of the basis w1, . . .wn are given by the above product of the matrixand the vector.

Remark.

(i) A linear operator (of finite dimensional vector spaces) is uniquely determined by theimage of a basis. The representation of these images in terms of the basis w j is givenby the columns of the above matrix.

(ii) Hence a system of linear equations may be considered as an equation L(v) = w, forvectors v ∈V and w ∈W . In particular, assume that L(v) = w. We then have that the setof all solutions of this equation is given by

v′ : L(v′) = w= v+u : u ∈ ker(L)= v+ker(L).

Applying the dimension formula now gives

dim(v′ : L(v′) = w− v) = dim(ker(L)) = dim(V )−dim(img(L)).

Also note that spaces of the form v+U , for a subspace U of V , are called affine spaces.Geometrically, these spaces correspond to lines, planes etc. which not necessarily con-tain the point 0.

Example 14. Let R be the rotation by 90 degree ccw in R2. Since

R :(

10

)7→(

01

),

(01

)7→(−10

),

we obtain by remark (i), that the associated matrix is

A :=(

0 −11 0

).

Moreover, let

B :=(

1 00 −1

).

By the same arguments as above, we obtain that B is the associated matrix to a reflection inthe (1,0) direction.

17

The above definition of the product of a matrix and a vector can be easily extended to a theproduct of two matrices. So assume that

A = (ai j : 1≤ i≤ l, 1≤ j ≤ m) ∈Ml×m(F),B = (b jk : 1≤ j ≤ m, 1≤ k ≤ n) ∈Mm×n(F).

Then the matrix product of A and B is defined by

A ·B :=(

∑mj=1 ai jb jk : 1≤ i≤ l, 1≤ k ≤ n

),

or equivalently,a1 j · · · a1m...

...al1 · · · alm

·b11 · · · b1n

......

bm1 · · · bmn

:=

∑mj=1 a1 jb j1 · · · ∑

mj=1 a1 jb jn

......

∑mj=1 al jb j1 · · · ∑

mj=1 a1 jb jn

In particular, the entry (i,k) of A ·B can be written as

(ai1,ai2, . . . ,aim) ·

b1kb2k

...bmk

=m

∑j=1

ai jb jk.

For the next example, we have to introduce the notion of the transpose of a matrix. So letA = (ai j : 1≤ i≤ m, 1≤ j ≤ n) an element of Mm×n(F). Then the matrix B = (bi j : 1≤ i≤n, 1≤ j ≤ m) is called the transpose of A, if

bi j = a ji, for 1≤ i≤ m, 1≤ j ≤ n.

In this situation, we write B := AT . Straightforward calculations now in particular give that

Proposition 2.2.4. Assume that A, B are matrices such that AB is defined. Then

(AB)T = BT AT .

Proof. Denote by ai j, b jk the entries of the matrix A and B, respectively. Then the entry (i,k)of AB is equal to the entry (k, i) of (AB)T , and by definition of the matrix multiplication equalto

∑j

ai jb jk.

So denote by aTi j, bT

jk the entries of the matrix AT and BT , respectively. So the entry (k, i) ofBT AT is equal to

∑j

bTk ja

Tji = ∑

jb jkai j = ∑

jai jb jk,

the assertion follows.

18

Example 15 (Standard scalar product). Assume that v,w ∈ Rn. Then vT and w can be seen aselements of M1×n(R) and w ∈Mn×1(R), respectively. Hence vT ·w is well defined, and

vT w = (v,w) ∈ R,

where (v,w) refers to the standard scalar product. If one replaces R by C in this example, thenwe have to consider v∗ := vT in order to obtain that v∗w is the standard scalar product.

Example 16. With v,w as above, note that

wvT =

w1v1 · · ·w1vn...

...wnv1 · · ·wnvn

∈Mn×n(F).

2.2.2 Matrix algebraWe now will investigate the space of square matrices. That is, for fixed n, let A := Mn×n(F),and ei j : 1 ≤ i, j ≤ n be the set of elements of A defined as follows. The matrix ei j is thematrix with entries equal to zero, except for the entry (i, j), which is equal to one. By the samearguments as for the Fn, it follows that A is a vector space, and that ei j : 1 ≤ i, j ≤ n is abasis of A . In particular, we know how to add elements of A . Moreover, since A is the spaceof square matrices, multiplication from the left or right with another matrix is well-defined.These two operations now interact as follows. For A,B,C ∈A , we have

A(BC) = (AB)CA(B+C) = AB+AC(A+B)C = AC +BC.

0A = 0, A0 = 0.

Let 1 := e11 + e22 + · · ·enn, or in other words, 1 is the matrix with 1’s on the diagonal and 0’selsewhere. This matrix is called the identity matrix, since

1A = A, A1 = A.

Summarizing, elements in A can be added/subtracted precisely like elements in F (A Math-ematician will tell You, that (A ,+) and (F,+) are Abelian groups). Moreover, since it isalso possible to multiply elements, the question arises whether there are more analogies tomultiplication in F (A Mathematician will tell You, that A is a noncommutative algebra withidentity).

(i) Does A 6= 0,B 6= 0 implies that A ·B 6= 0? The answer is no:(1 00 0

)·(

0 00 1

)= 0.

19

(ii) Does AB = BA holds? The answer is again no! With A,B referring to the rotation andreflection in Example 14, we have

AB =(

0 11 0

), BA =

(0 −1−1 0

)(iii) For A ∈A , does there exists an element B ∈A , such that

AB = BA = 1?

The answer here is sometimes. If there exists B ∈ A with the above property, then Ais called invertible, and the inverse of A is A−1 := B. By geometric considerations (orby explicit calculation), note that, with A,B referring to the rotation and reflection inExample 14,

A4 = 1, and B2 = 1.

Hence A and B are invertible, and the inverses are given by A−1 = AAA, and B−1 = B,respectively.

Moreover, it is worth noting that for invertible matrices A1, . . . ,An, we have

(A1 · · ·An)−1 = A−1n · · ·A−1

1 ,

and hence A1 · · ·An is invertible (this can be checked by simplifying A1 · · ·An ·A−1n · · ·A−1

1and A−1

n · · ·A−11 ·A1 · · ·An).

2.2.3 Solving systems of linear equations, and inverting matricesRecall that a system of linear equations can be solved using operations on the rows. Theseoperations can be recovered using elementary matrices. This relation is used to constructmethods for solving systems of linear equations, determine the rank of a matrix, and determinethe inverse of an invertible matrix.

Elementary matrices.

For n ∈ N, consider the following matrices in Mn×n(F).

Type 1, i 6= j. “Adding λ times row i to row j”. For λ 6= 0, let

E := 1+λei j.

Then by direct calculation, it follows that E−1 = 1−λei j, and hence E is invertible.

Type 2, i = j. “Multiplying row i with λ 6= 0”. Let

E := 1+(λ −1)eii.

The inverse here is given explicitly by E−1 = 1+(1/λ −1)eii.

20

Type 3, i 6= j. “Interchange rows i and j”.

E := 1− eii− e j j + ei j + e ji, E−1 = 1− eii− e j j + ei j + e ji.

It is left as an exercise to verify that the matrices in fact correspond to row operations asclaimed above. In here, we only give an example, which connects the Gaussian algorithm forsolving systems of linear equations with multiplication by elementary matrices.

Example 17. We now consider the following example of a system of linear equations, orequivalently the equation defined by matrix multiplication on the right hand side.

x +2y +z = 1x +2y −z = 2

2x +3y −3z = 1

1 2 11 2 −12 3 −3

xyz

=

121

The Gaussian algorithm now tells us that one should subtract row 1 from row 2:

x +2y +z = 1−2z = 1

2x +3y −3z = 1

In terms of elementary matrices, this reads as 1 0 0−1 1 00 0 1

1 2 11 2 −12 3 −3

=

1 2 10 0 −22 3 −3

,

1 0 0−1 1 00 0 1

121

=

111

.

In the sequel, we now only specify the elementary matrix, which as to be multiplied from theleft. Now subtract subtract two times row 1 from row 2:

x +2y +z = 1−2z = 1

−y −5z = −1

1 0 00 1 0−2 0 1

.

Now interchange rows 2 and 3:

x +2y +z = 1−y −5z = −1

−2z = 1

1 0 00 0 10 1 0

.

Multiply rows 2 and 3 by −1, and −12 , respectively:

x +2y +z = 1y +5z = 1

z = −12

1 0 00 −1 00 0 0

,

1 0 00 1 10 0 −1

2

.

Subtract 5 times row 3 from row 2:

21

x +2y +z = 1y = 7

2z = −1

2

1 0 00 1 −50 0 1

.

Subtract row 3 from row 1, and 2 times rows 2 from row 1:

x = −112

y = 72

z = −12

1 0 0−2 1 00 0 1

,

1 0 00 1 0−1 0 1

.

Using row operations we managed to transform the matrix into the so called row-echelonform. However, we also could perform that by subsequent multiplication with invertible ma-trices from the left!

Row echelon form

We now generalize the above example to arbitrary matrices. Therefore, we have to introducethe notion of the row echelon form (which unfortunately is not defined consistently in theliterature).

Definition 2.2.5. The matrix A ∈Mm×n(F) is said to be in row echelon form (REF), if

A =

1 ∗· · ·∗ 0 ∗· · ·∗ 0 ∗· · · · · ·

1 ∗· · ·∗ 0 ∗· · · · · ·1 ∗· · · · · ·

. . .

.

Example 18. The following matrices are in REF (here, ∗ stands for arbitrary elements of F):(1 ∗0 0

),

(0 10 0

),

(0 1 00 0 1

),

(1 ∗ 00 0 1

)The following matrices are not in REF:

(1 10 1

),

(1 00 3

),

1 ∗ 00 0 00 0 1

.

Using the methods of Example 17, we obtain the following obvious but important result,since a lot of properties of a matrix can be read of from its REF.

Theorem 2.2.6. For each A∈Mm×n(F), there exist elementary matrices E1,E2, . . .Ek ∈Mm×m(F)such that

Ek · · ·E2 ·E1A

is in REF.

22

Proof. So assume without loss of generality, that A 6= 0 (since 0 is in REF).

Step 1. If there exists no entry unequal to zero in the the first column, check whether there isan entry in the second column unequal to zero. If not, then proceed until there is 1 ≤ k ≤ nsuch that the k-th column contains a non zero element. Using a type 3 matrix E1, one obtainsthat the entry (k,1) of E1A is unequal to zero. Then use type 1 matrices E2, . . .El1 such thatthe entries (2,k), . . .(m,k) of

El1 · · ·E2E1A

are all equal to zero. Hence,

El1 · · ·E2E1A =

0 · · ·0 ∗ ∗· · ·0 · · ·0 0 ∗· · ·

......

...0 · · ·0 0 ∗· · ·

.

By iterating this procedure, one arrives at elementary matrices E1, . . . ,El such that

El · · ·E2E1A =

0 · · ·0 ∗· · ·∗ ∗ · · ·∗0 · · ·0 0 · · ·0 ∗· · ·∗0 · · ·0 0 · · ·0 0 · · ·0

......

... . . .

.

Step 2. By choosing suitable type 2 matrices El+1, . . . ,El′ we obtain that

El′ · · ·E2E1A =

0 · · ·0 1 ∗· · ·∗ ∗ ∗· · ·∗0 · · ·0 0 0 · · ·0 1 ∗· · ·∗0 · · ·0 0 0 · · ·0 0 0 · · ·0

......

... . . .

.

Step 3. Now, by applying again type 2 matrices, it is possible to eliminate the entries on topof the 1’s. That is, there existEl′+1, . . . ,Ek such that

Ek · · ·E2E1A =

0 · · ·0 1 ∗· · ·∗ 0 ∗· · ·∗0 · · ·0 0 0 · · ·0 1 ∗· · ·∗0 · · ·0 0 0 · · ·0 0 0 · · ·0

......

... . . .

.

This proves the assertion.

Note that the product AB of two invertible matrices A,B ∈Mn×n is again invertible, since

ABB−1A−1 = 1 = B−1A−1AB.

In particular, it follows by induction that Ek · · ·E2 ·E1 is invertible. For invertible matrices, thefollowing holds.

23

Lemma 2.2.7. Assume that A ∈Mn×n is invertible. Then

(i) A is one-to-one (or injective): if, for v,w ∈ Fn, Av = Aw, then v = w.

(ii) A is onto (or surjective): if, for w ∈ Fn, there exists v ∈ Fn such that Av = w, or equiva-lently, img(A) = Fn

Proof. In order to prove (i), note that Av = Aw implies that A(v−w) = 0. Hence 0 = A−1(A(v−w)) = (A−1A)(v−w) = v−w. For two, set v := A−1w.

Application of the REF - rank and image of a linear operator

The rank of a linear operator L : V →W is defined as the dimension of img(L) = L(V ), that is

rank(L) := dim(img(L)).

Using the REF it is now possible to determine the rank explicitly. Recall that, for a given basisv1, . . .vn of V , and w1, . . .wn of W , respectively, the linear operator is represented by anassociated matrix A ∈Mm×n(F):

∑λivi ∑ai jλ jwi

(λ1, . . . ,λn)T A(λ1, . . . ,λn)T?

-L

?-A

By applying Theorem 2.2.6, there exist E1,E2, . . .Ek ∈Mm×m(F) such that Ek ·Ek−1 · · ·E1 · · ·Ais in REF. The operators are organized as follows.

V W

Fn Fm

Fm

?

-L

?

@@@REk···E1·A

-A

Ek···E1

For general matrices, we have that the image of the matrix (which is a subspace, see Definition2.2.2) is generated by the column vectors of the matrix. Since in here, Ek · · ·E1A is in REF, thedimension of the image img(Ek · · ·E1A) has to be the number of rows which are not identicalto zero of the REF. In particular, since Ek · · ·E1 is an invertible linear operator,

dim(img(Ek · · ·E1A)) = dim(img(A)).

Hence,rank(L) = rank(A) = ] rows of REF 6= (0, . . . ,0).

24

Remark (ESM2B). Moreover, it is possible to determine a basis of the image of A (or equiva-lently, of L). So denote entries of Ek · · ·E1A by (bi j), and

ki := min j : bi j = 1,

for i = 1, . . . rank(A) (so k1, . . . ,krank(A) are the columns where the REF “jumps”). By directcalculation, it follows that (here, ei denotes the standard basis, on the left hand side for Fm,and on the left hand side for Fn!)

ei = Ek · · ·E1Aeki.

Note that e1, . . . ,erank(A) are linearly independent (in Fm). Hence, since Ek · · ·E1 is invertible,

E−11 · · ·E

−1k e1, . . . ,E−1

1 · · ·E−1k erank(A)= Aek1, . . . ,Aekrank(A)

is linearly independent. So Aek1, . . . ,Aekrank(A) is a basis of img(A). In other words, thecolumns k1, . . . ,krank(A) of A are a basis of the image.

Example 19. We now apply the theory of matrices and REF to Example 13. The matrix A ofthe linear operator L is given by

A :=(

4 22 1

).

We now determine the REF:(4 22 1

)→(

2 12 1

)→(

2 10 0

)→(

1 12

0 0

).

Hence the rank of A is equal to 1. In particular, k1 as defined above, is equal to 1. Hence

Ae1 =(

4 22 1

)(10

)=(

42

)is a basis for img(A) = img(L). Moreover, the kernel of A is the set of solutions for Av = 0(solving systems of linear equations is subject of next section).

Application of the REF - solving systems of linear equations

Before applying the REF to solve systems of linear equations, we will have a look to theproblem from a more general point of view, that is we look at the following problem.

Let L : V →W be a linear operator, and w ∈W be given. We now want to deter-mine the set

x ∈V : L(x) = w,

or in other words, find all x ∈V with L(x) = w.

In order to proceed, we have to distinguish between the following two cases. If w = 0, thenthe problem is called homogenous, and if v 6= 0, then it is called inhomogeneous. Note that

25

(i) L(x) = w has a solution if and only if w ∈ img(L) = L(V ),

(ii) in the homogenous case, 0 is a solution.

From this general point of view, one proceeds as follows.

Step 1. Decide whether a solution exists. In case that there are no solutions, the problem issolved. If a solution exists, find x0 ∈V with L(x0) = w.

Step 2. Determine ker(L) (or in other words, solve the associated homogenous problem L(x)=0). The set of solutions is then given by

x0 +ker(L) = x0 + v : v ∈V,L(v) = 0.

Proof (ESM2B). Choose x∈V with L(x) = v. Then L(x−x0) = L(x)−L(x0) = w−w =0. Hence x− x0 ∈ ker(L), and in particular,

x ∈ x0 + v : v ∈V,L(v) = 0.

This implies that x : L(x) = w ⊂ x0 + ker(L). The reverse inclusion follows, forv ∈ ker(L) from L(x0 + v) = w+0.

For the case of finite dimensional vector spaces, the problem transforms by choosing basesfor V and W into the following.

Let A ∈Mm×n(F), and w ∈ Fm. Then, for x ∈ Fn,

Ax = w,

is called a system of linear equations. We now want to determine the set

x ∈ Fn : Ax = w,

or in other words, find all x ∈ Fn with Ax = w.

To apply the above considerations, we have to introduce the so called augmented matrix (A|w).That is (A|w) ∈Mm×n+1(F) is the matrix whose n+1-th column is equal to w. We now obtainthe following algorithm.

Algorithm 1 (Solving a system of linear equations).

Step 0. Transform (A|v) into REF. We hence arrive at an augmented matrix 0 · · ·0 1 ∗· · ·∗ 0 · · · w′11 · · · w′2

. . . ...

.

Step 1. Using the REF we are now in position to decide whether there exists solution or not:

26

If w′r+1 = · · ·w′m = 0, then the set of solutions is nonempty. In here, r :=rank(A) (= number of rows unequal to zero in the REF of A). Or equivalently,if there exists a row, where only the n + 1-th entry is not equal to zero, thenthere is no solution.

So assume that w′r+1 = · · ·w′m = 0, and let k1,k2, . . . ,kr defined as above (see p. 25).Then

x0 =r

∑i=1

w′ieki

is a solution.

Step 2. The advantage of the REF (in the strict sense defined in the course) is that a basis ofthe kernel can be read of directly. Since we want to determine the solutions of Ax = 0,we have to consider the above REF with w′1 = · · ·= w′m = 0. For x = (x1, . . .xn)T , notethat each row of the matrix reads as follows.

(i) If i > r, we then have 0 = 0.

(ii) If i≤ r, we then have

xki +∗ · xki+1 + · · ·+∗ · xki+1−1 +0 · xki+1 +∗ · xki+1+1 + · · ·= 0.

Using this observation, we arrive at the following (extended by trivial equations) systemof linear equations:

x1 = x1

...xk1 =−∗ ·xk1+1−·· ·−∗ · xk1+1−1−0 · xk1+1−∗ · xk1+1+1−·· ·

xk1+1 = xk1+1...

By writing this system of linear equations as a matrix, we obtain that column vectors ofthis matrix generate ker(A), and by dropping the the vectors which are equal to 0, weobtain a basis of ker(A) (this follows e.g. from the dimension formula).

Step 3. If x0 exists, then the set of all solutions of the system of linear equations is given by

x0 +ker(A).

We now illustrate this procedure in an example.

Example 20. Assume that the REF of (A|w) is given by, for a,b,c ∈ R, 0 1 2 5 0 1 1 a0 0 0 0 1 3 2 b0 0 0 0 0 0 0 c

.

27

Then rank(A) = 2, and k1 = 2, k2 = 5. Moreover, a solution of Ax = w exists if and only ifc = 0. So assume that c = 0. Then

x0 = ae2 +be5 = (0,2,0,0,5,0,0)T .

The extended system of linear equations now is given by the 7×7-matrix

1 0 0 0 0 0 00 0 −2 −5 0 −1 −10 0 1 0 0 0 00 0 0 1 0 0 00 0 0 0 0 −3 −20 0 0 0 0 1 00 0 0 0 0 0 1

Here, the rows given by the REF are typed in boldface, as well as the two zero-columns (the

second and the fifth). Hence,

1000000

,

0−210000

,

0−501000

,

0−100−310

,

0−100−201

is a basis of ker(A). The set of all solutions is hence given by

0a00b00

+λ1

1000000

+λ2

0−210000

+λ3

0−501000

+λ4

0−100−310

+λ5

0−100−201

: λ1, . . .λ5 ∈ R

.

Application of the REF - the inverse of a matrix

A further non-trivial application of the REF is related with invertible matrices.

Corollary 2.2.8. Assume that A,B ∈Mn×n(F), such that AB = 1. Then A and B are invertible,and in particular, A−1 = B, B−1 = A.

Sketch of proof. Assume that AB = 1, and let E1, . . . ,Ek such that Ek · · ·E1A is in REF. Hence

Ek · · ·E1AB = Ek · · ·E1.

So assume that rank(A) 6= n. Then the bottom row of Ek · · ·E1A has to be zero. Hence, alsothe bottom rows of Ek · · ·E1AB and Ek · · ·E1 have to be zero. But this is a contradiction torank(Ek · · ·E1) = n (since Ek · · ·E1 is invertible).

So Ek · · ·E1A01, and B = Ek · · ·E1. So BA = 1, which proves that A−1 = B.

28

As an immediate consequence of the above Corollary, we obtain an algorithm to find out,whether a matrix is invertible, and to determine the inverse. So let A ∈Mn×n(F).

Algorithm 2 (Inversion of a matrix).

(i) Transform the matrix (A|1) into REF.

(ii) If the REF is of form (1|B), then A is invertible, and A = B−1.

(iii) If the REF is not of form (1|B), then A is not invertible.

Example 21 (Inversion of 2×2-matrices). We now determine the inverse of a general matrixin M2×2(F), where we have to assume that a 6= 0 and ad−bc 6= 0.(

a bc d

∣∣∣∣ 1 00 1

)→(

a b0 d− bc

a

∣∣∣∣ 1 0− c

a 1

)→(

a b0 1

∣∣∣∣ 1 0− c

ad−bca

ad−bc

)→(

1 ba

0 1

∣∣∣∣ 1a 0

− cad−bc

aad−bc

)→(

1 00 1

∣∣∣∣ dad−bc

−bad−bc

− cad−bc

aad−bc

)If a = 0, the calculations are less complicated, but the outcome is the same (left as exercise).Hence we have shown that a matrix in M2×2(F) is invertible if and only if ad− bc 6= 0, andthat (

a bc d

)−1

=1

ad−bc

(d −b−c a

)

2.2.4 DeterminantsThe determinant is a function det : Mn×n(F)→ F, A 7→ det(A) = |A|, such that

(i) det(1) = 1,

(ii) det(AB) = det(A)det(B),

(iii) det(A) 6= 0 if and only if A is invertible,

(iv) det(A) is the (oriented) volume of the parallelepiped spanned by the column vectors.

We will now present two approaches, a computational one and a theoretical (or more precisely,an axiomatic approach). The second approach will shed light on the connection betweendeterminants, volume, and row operations, whereas the first will provide the existence of adeterminant as well as a method for calculating determinants in specific cases. However, forgeneral matrices, the transformation into REF is the most efficient way.

29

Determinants - the computational definition

The general formula for the determinant of general matrices in Mn×n(F) is mainly importantfrom the theoretical point of view, because it involves a lot of calculations (it is a sum of n!summands of products of n elements). So we will give a recursive definition. Therefore, wehave to introduce the following notion. For i, j ∈ 1, . . .n, let Ai j be the matrix, where thei-th row and the j-th column is removed. For example, for A = (ai j : i, j ∈ 1, . . .3),

A12 =(

a21 a23a31 a33.

)Definition 2.2.9. For n = 1, and a ∈ Mn×n(F) = F, let det(a) = a. For arbitrary n, we nowgive the definition in terms of determinants of (n− 1)× (n− 1) matrices. Here, we have thefollowing two equivalent definitions for the determinant of A ∈Mn×n(F).

(i) For any i,

det(A) :=n

∑j=1

(−1)i+ jai j det(Ai j).

(ii) For any j,

det(A) :=n

∑i=1

(−1)i+ jai j det(Ai j).

Definition (i) is also known as expansion along row i, and definition (ii) as expansion alongcolumn j. Below, using the axiomatic approach to determinants, we will see that usingthese definitions, we always obtain the same value for the determinant. We begin with severalexamples.

(i) n = 2. Let

A =(

a bc d

)So choose the second version of the definition, and j = 1. Then A11 = d, and A21 = b.Hence

det(A) = adet(A11)−bdet(A12) = ad−bc.

(ii) n = 3. Let

A =

a11 a12 a13a21 a22 a23a31 a32 a33

.

So choose the first version of the definition, and i = 1. Then

det(A) = |A|= a11

∣∣∣∣ a22 a23a32 a33

∣∣∣∣−a12

∣∣∣∣ a21 a23a31 a33

∣∣∣∣+a13

∣∣∣∣ a21 a22a31 a32

∣∣∣∣= a11a22a33 +a12a23a31 +a13a21a32−a13a22a31−a11a23a32−a12a21a33.

30

There is mnemonic for this formula (this only applies to n = 3). Organize the data asfollows:

a13 a11 a12 a13 a11a21 a22 a23

a33 a31 a32 a33 a31

Then the positive summands in the formula are products of the entries on the diagonals, whereas the ones on are the negative summands.

(iii) The definition of the determinant turns out to be very handy, if the matrix has a rowor column, which contains zero entries. By choosing the appropriate version of thedefinition (and i or j), the calculation simplifies a lot. An important example, wherethis leads to simple expressions of the determinant, is given by the application to upperdiagonal matrices. A matrix A ∈Mn×n(F) is said to be an upper diagonal matrix, if A isof form

λ1 ∗ · · · ∗0 λ2 ∗ ∗

. . .0 · · · 0 λn

.

We will now apply the definition (2nd Version, i = 1) to show that

det(A) = λ1λ2 · · ·λn.

Proof. This is shown by recursively, using the 2nd definition with i = 1.

det(A) =

λ1 ∗ · · · ∗0 λ2 ∗ ∗

. . .0 · · · 0 λn

= λ1A11−0 ·A21 +0 ·A31 · · ·

= λ1

λ2 ∗ · · · ∗0 λ3 ∗ ∗

. . .0 · · · 0 λn

= λ1λ2

λ3 ∗ · · · ∗0 λ4 ∗ ∗

. . .0 · · · 0 λn

= λ1 · · ·λn.

Determinants - the axiomatic approach

We will now give an axiomatic approach to determinants, which will clarify several featuresof this object. So assume we want to find a function d which determines the (oriented, n-dimensional) volume of a parallelepiped spanned by n vectors of Fn. So let A be the matrix,where the given vectors are the rows of A. So we may assume that d is a function fromMn×n(F)→ F.

31

Then, if such a function exists, this function should be linear with respect to reach row(this is intuitively clear from the properties of the volume). Moreover, since the cube is theparallelepiped spanned by the standard basis, we should require that d(1) = 1. Moreover, weshould also have that only objects of full dimension have non-zero volume (what is the volumeof a line?). This leads to the following object (from which we do not know at the moment,whether it exists or not).

Let d : Mn×n(F)→ F be a function such that

(i) d(1) = 1,

(ii) d is linear with respect to each row, that is, for v,w row vectors with entries in F, andλ ∈ F,

(a) d

∗ ∗ ∗−− v+w −−∗ ∗ ∗

= d

∗ ∗ ∗−− v −−∗ ∗ ∗

d

∗ ∗ ∗−− w −−∗ ∗ ∗

.

(b) d

∗ ∗ ∗−− λv −−∗ ∗ ∗

= λd

∗ ∗ ∗−− v −−∗ ∗ ∗

.

(iii) if two rows are identical, then d(A) = 0.

d

∗ ∗ ∗−− v −−∗ ∗ ∗−− v −−∗ ∗ ∗

= 0.

Using this definition it is now possible to determine the change of d under row operations(as above, v,w row vectors with entries in F, and λ ∈ F). For each case, A will refer to theoriginal matrix, and A′ to the matrix after the row transformation.

Type 1 “Adding the multiple of a row to another”.

d

∗ ∗ ∗−− v −−∗ ∗ ∗−− w+λv −−∗ ∗ ∗

= d

∗ ∗ ∗−− v −−∗ ∗ ∗−− w −−∗ ∗ ∗

+λd

∗ ∗ ∗−− v −−∗ ∗ ∗−− v −−∗ ∗ ∗

= d

∗ ∗ ∗−− v −−∗ ∗ ∗−− w −−∗ ∗ ∗

.

Hence adding the multiple of a row does not change d (i.e. d(A) = d(A′)).

Type 2. “Multiplying a row with λ 6= 0”. As an immediate consequence of the definition, itfollows that λd(A) = d(A′).

32

Type 3. “Interchange rows”.

d

∗ ∗ ∗−− v −−∗ ∗ ∗−− w −−∗ ∗ ∗

= d

∗ ∗ ∗−− v −−∗ ∗ ∗−− w− v −−∗ ∗ ∗

= d

∗ ∗ ∗−− v− (v−w) −−∗ ∗ ∗−− w− v −−∗ ∗ ∗

= d

∗ ∗ ∗−− w −−∗ ∗ ∗−− w− v−w −−∗ ∗ ∗

= d

∗ ∗ ∗−− w −−∗ ∗ ∗−− −v −−∗ ∗ ∗

Hence interchanging rows gives d(A) =−d(A′).

These simple observations have several far-reaching consequences.

Theorem 2.2.10. The function d is uniquely determined by assumptions (i) to (iii). Since detsatisfies these properties, d = det. Moreover, for A,B ∈Mn×n(F),

(i) det(AB) = det(A)det(B),

(ii) det(A) = det(AT ).

Proof. By the above considerations, it follows for elementary matrices, that d of a type 1-matrix is equal to one, d of a type 3-matrix is equal to minus one, and

1. . .

λ

1

= λ .

So assume that A′ is obtained from A by a row operation which corresponds to a multiplicationfrom the left by an elementary matrix E. It hence follows that d(A′) = d(E)d(A). Hence, ifwe transform an invertible matrix A into REF (in this case, the REF is equal to 1) by means ofelementary matrices E1, . . . ,Ek, we obtain

d(A) = d(E1) · · ·d(Ek).

If A is not invertible, we obtain by the same argument, that d(A) = 0. Hence, d is uniquelydetermined, and d(AB) = d(A)d(B) follows from this.

In order to see, that d = det, we hence have only to verify that det satisfies properties (i)-(iii)(this is left as an exercise). In particular, this proves that the two versions of the computationaldefinition coincide, and, in particular, that det(A) = det(AT ).

From the proof, we obtain the following algorithm for the computation of the determinantof a matrix A:

33

Algorithm 3. Do row operations (say, corresponding to elementary matrices E1, . . .Ek) until Ais transformed into a matrix A′, such that det(A) is known (e.g. A′ an upper triangular matrix).Then

det(A) =det(A′)

det(E1) · · ·det(Ek).

In fact, the argument here can be generalized to the multiplication with elementary matri-ces from the left and the right (say, by elementary matrices E1, . . .Ek from the left, and byelementary matrices E ′1, . . .E

′l from the right). Then, for A′ = Ek · · ·E1AE ′1 · · ·E ′l , we have

det(A) =det(A′)

det(E1) · · ·det(Ek)det(E ′1) · · ·det(E ′l).

So, since multiplication from the right corresponds to column operations, it is in fact possibleto apply simultaneously row and column operations to calculate the determinant.

We now summarize properties of invertible matrices.

Proposition 2.2.11. For A ∈Mn×n(F), the following statements are equivalent.

(i) A is invertible.

(ii) det(A) 6= 0.

(iii) rank(A) = n.

(iv) ker(A) = 0.

(v) The row vectors of A are linearly independent.

(vi) The column vectors of A are linearly independent.

Proof. By the dimension formula, (iii) and (iv) are equivalent. By transforming A into REF,we obtain the equivalence of (i), (ii), and (iii). Since img(A) = Fn if and only if the columnvectors of A are linearly independent, it remains to show that (v) is equivalent to the otherstatements. This can be deduced from det(AT ) = det(A).

2.3 EndomorphismsAn endomorphism of a vector space V is a linear operator L : V →V (so V is mapped into itselfby L). The question which will be discussed below, for which endomorphisms there exists abasis such that the associated A is a diagonal matrix.

The first step in the development of the theory is to describe a change of basis in terms ofmatrices (this applies to general finite dimensional vector spaces).

34

2.3.1 Change of baseSo assume that V is a finite dimensional vector space, and that v1, . . .vn, v′1, . . .v′n are twobases of V , and that

n

∑i=1

λivi =n

∑i=1

µiv′i ∈V.

We now want to determine a matrix which maps (µ1, . . . ,µn)T to (λ1, . . . ,λn)T and vice versa.Therefore we will use the following trick. As a generalized notion of the matrix product, wedefine, for A = (ai j) ∈Mn×k(F),

(v1, . . . ,vn)A :=

(n

∑j=1

a j1v j, . . . ,n

∑j=1

a jkv j

).

Note that in here, the vi are elements of V , and not in F. Now choose pi j (1 ≤ i, j ≤ n) suchthat

n

∑j=1

pi jv′i = v j

for all j = 1, . . . ,n. With the notation above, and P = (pi j) this reads as

(v′1, . . . ,v′n)P = (v1, . . . ,vn).

By symmetry, it follows that there exists P′ ∈Mn×n(F) such that

(v1, . . . ,vn)P′ = (v′1, . . . ,v′n).

Combining these two equations then gives

(v1, . . . ,vn)P′P = (v1, . . . ,vn).

Since v1, . . .vn is a basis, it follows that P′P = 1. Hence P is invertible, and P−1 = P′.We organized things in such way, that we are now able to multiply the above equations with(λ1, . . . ,λn)T from the right. This gives

n

∑i=1

λivi = (v1, . . . ,vn)

λ1...

λn

= (v′1, . . . ,v

′n)P

λ1...

λn

.

Hence the above problem is solved by (µ1, . . . ,µn)T = P(λ1, . . . ,λn)T , and by (λ1, . . . ,λn)T =P−1(µ1, . . . ,µn)T , respectively.

35

Example 22. Recall that in Example 13, we have shown that the operator L is given by thematrix

A =(

4 22 1

)maps (2,1)T to (10,5)T , and (−1,2)T to 0. We now first want to determine the matrix ofthe change of basis from the standard basis e1,e2 to (2,1)T ,(−1,2)T, and then use thischange of basis to obtain the matrix of L associated with the new basis.Change of basis. The values p′i j are determined by(

21

)= p′11e1 + p′21e2 = 2e1 +1e2,

(−12

)= p′12e1 + p′22e2 =−1e1 +2e2.

We hence obtain that

P′ =(

2 −11 2

), P = (P′)−1 =

15

(2 1−1 2

)Matrix of L associated with new basis. The matrix associated to the linear operator withrespect to the new basis can be determined with the following two methods.

(i) Since L((2,1)T ) = (10,5)T = 5(2,1)T , and L((−1,2)T ) = 0, we conclude immediately,that the matrix with respect to the new basis is given by

B =(

5 00 0

).

Note that this method here only applies since we know the representation of L((2,1)T )with respect to the new basis.

(ii) For the general method, the associated operators are organized as follows.

Fn Fn

V V

Fn Fn

-A

?

P@@

@@I vi

v′i

-Lvi

@@@@Rv′i

6

P−1

-B

In the example, we obtain

B = PAP−1 =15

(2 1−1 2

)(4 22 1

)(2 −11 2

)=(

5 00 0

).

36

2.3.2 Eigenvectors and eigenvaluesIn the above example, we were able to simplify the matrix by choosing an appropriate basis.In general, this method relies on the existence of eigenvalues and eigenvectors.

Definition 2.3.1. Let L : V →V be an endomorphism. Then λ ∈ F is called eigenvalue of theeigenvector v ∈V , if v 6= 0, and

L(v) = λv.

If there exists a basis of eigenvectors for V , then L is called diagonalizable.

Note that in contrast to an eigenvector, an eigenvalue might be equal to zero. In Example22, (2,1)T is an eigenvector with respect to the eigenvalue 5, and (−1,2)T an eigenvector withrespect to the eigenvalue 0. The name ‘diagonalizable’ of the last definition stems from thefollowing simple fact: if L is diagonalizable, then the matrix A with respect to the basis ofeigenvectors is given by

A =

λ1 0 · · · 0

0 λ2 0...

. . .0 · · · 0 λn

,

where λ1, . . . ,λn refer to the corresponding eigenvalues. In order to determine eigenvectorsand eigenvalues of a given endomorphism L, one makes use of the following observations (inhere, id : V →V,v 7→ v denotes the identity).

(i) Assume that λ is an eigenvalue of L. Then L−λ id is not invertible.

(ii) So assume that L−λ id is not invertible. Then there exists v ∈ V,v 6= 0 such that (L−λ id)(v) = L(v)−λv = 0. Hence λ is an eigenvalue.

Now fix a basis of V , and let A be the associated matrix to L. As a consequence of Proposition2.2.11, we obtain the following proposition.

Proposition 2.3.2. λ is an eigenvalue of A if and only if det(A−λ1) = 0.

Moreover, it can be shown3 that det(A− t1) is polynomial of degree dim(V ). This polyno-mial is called characteristic polynomial of A, and is denoted by

χA(t) = χ(t) := det(A− t1).

We have shown that the following holds.

Proposition 2.3.3. The element λ ∈ F is an eigenvalue of A if and only if λ is a root of thecharacteristic polynomial (i.e. χA(λ ) = 0).

3Using the linearity in the rows, we obtain that det(A− t1) = det(A)+ · · ·+ tn.

37

So assume that A∈Mn×n(F). The fundamental theorem of algebra4 tells us, that there existλ1, . . .λk ∈ C with λi 6= λ j for i 6= j, and l1, l2, . . . lk with l1 + l2 + · · ·+ lk = n such that

χA(t) =k

∏i=1

(t−λi)li.

In here, we will refer to li as the algebraic multiplicity of the eigenvalue λi. Hence, for eachA ∈Mn×n(F), there exist n eigenvalues (counted with algebraic multiplicity). However, notethat the eigenvalues might be complex, even if the matrix is an element of A ∈ Mn×n(R).In order to determine the associated eigenvectors, and to decide whether a given matrix isdiagonalizable, we have to look at the following objects.

Definition 2.3.4. Let λ ∈ C be an eigenvalue of the endomorphism L : V →V . Then

Eλ := v ∈V : L(v) = λv

is called the eigenspace of λ , and dim(Eλ ) is called the geometric multiplicity of λ .

Note that Eλ and dim(Eλ ) can be determined as follows. Let A∈Mn×n(F) be the associatedmatrix to L for some given basis. Then

Eλ = v ∈ Fn : Av = λv= v ∈ Fn : Av−λv = 0= ker(A−λ1),

and hence by the dimension formula, we obtain that dim(Eλ ) = n− rank(A−λ1). But theseare linear equations, and we already know, how to solve them! The following proposition nowanswers the question, whether a matrix is diagonalizable or not.

Proposition 2.3.5.

(i) A matrix A ∈Mn×n(C) is diagonalizable (w.r.t. to C) if and only if the geometric multi-plicities are equal to the algebraic multiplicities.

(ii) A matrix A ∈Mn×n(R) is diagonalizable (w.r.t. to R) if and only if the geometric mul-tiplicities are equal to the algebraic multiplicities, and there exist λ1, . . .λk ∈ R withλi 6= λ j for i 6= j, and l1, l2, . . . lk ∈ N with l1 + l2 + · · ·+ lk = n such that

χA(t) =k

∏i=1

(t−λi)li. (2.3)

4Recall that the fundamental theorem of algebra states that for each polynomial

p(z) = anzn +an−1zn−1 + · · ·+a1z+a0

with a0, . . .an ∈C, an 6= 0, there exist z1, . . .zk ∈C with λi 6= λ j for i 6= j, and l1, l2, . . . lk with l1 + l2 + · · ·+ lk = nsuch that

p(z) = an

k

∏i=1

(z− zi)li .

In particular, z1, . . .zk ∈ C are the roots of p(z), that is p(z) = 0 if and only if z = zi for some i = 1, . . .k.

38

For the proof of the theorem, we will make use of the following proposition.

Proposition 2.3.6. If v1 ∈ Eλ1 , . . . , vk ∈ Eλkand v1 6= 0, . . . ,vk 6= 0, then v1, . . . ,vk are linearly

independent.

Proof.

Proof of theorem. So choose a basis for each Eλi . Then the union of the bases of Eλ1, . . .Eλkis linearly independent. Since the geometric multiplicities are equal to the algebraic multi-plicities, it follows that this union consists of n vectors, and hence the union is a basis ofFn.

Summarizing the above results, we obtain the following algorithm in order to decide whethera given matrix A ∈ Mn×n(F) is diagonalizable, to determine the eigenvalues and the corre-sponding basis such that after a change of basis, the matrix is in diagonal form.

Algorithm 4 (Decide whether A is diagonalizable).

Step 1. Calculate det(A− t1).

Step 2. Determine all roots of χA(t) = det(A− t1). These roots are the eigenvalues of A, sayλ1, . . . ,λk ∈ F).

Step 3a. For F = R: Decide whether χA(t) is of form (2.3), for λ1, . . .λk ∈R, and l1, l2, . . . lk ∈N.

Step 3b. For F = C: Determine l1, l2, . . . lk ∈ N such that χA(t) is of form (2.3).

Step 4. Determine the geometric multiplicity of each eigenvalue λ . If the algebraic multi-plicity of λ is equal to one, it follows that also the geometric is equal to one. If thealgebraic multiplicity of λ is bigger than to one, then the geometric multiplicity can bedetermined using one of the following methods:

(i) Determine rank(A−λ1) using row operations. By the dimension formula, it fol-lows that the geometric multiplicity of λ is equal to n− rank(A).

(ii) Determine a basis of ker(A−λ1) using the REF. The geometric multiplicity of λ

is then given by the number of elements of this basis. This method is in particularuseful, if one needs to determine the eigenspace of λ .

Step 5a. For F = R: If the geometric multiplicities are equal to the algebraic multiplicities,and χA(t) is of form (2.3), then A is diagonalizable.

Step 5b. For F = C: If the geometric multiplicities are equal to the algebraic multiplicities,then A is diagonalizable.

39

Step 6. If A is diagonalizable, then the diagonal form of A is

λ1 0 · · ·0 . . . 0 · · ·· · · 0 λ1 0 · · ·· · · 0 λ2 0 · · ·

. . .· · · 0 λk 0 · · ·· · · 0 . . . 0· · · 0 λk

,

where each λi occurs precisely li-times.

Note that in order to obtain the diagonal form of A, we do not have to determine a basis ofthe eigenspaces, we only have to know their dimensions. In order to determine the basis ofeigenvectors of a diagonalizable matrix A, the above algorithm has to be modified as follows.

Algorithm 5 (Computing the basis of eigenvalues for A diagonalizable).

Step 1 - 3. As above.

Step 4. For each eigenvalue λ , determine a basis of ker(A−λ1) using the REF. The union ofthe elements of the bases for each eigenvalue of A is then a basis of Fn (this is due to thefact that A is assumed to be diagonalizable).

Example 23. Let A be the matrix given by a 90 rotation ccw, that is

A =(

0 −11 0

).

Then

χA(t) = det((

0 −11 0

)−(

t 00 t

))=−t −1

1 −t = t2 +1.

Hence the eigenvalues of A are i and −i, and χA(t) = (t− i)(t + i) (these are steps 1-3 of theabove algorithm, for F = C). The eigenspaces are determined using the REF:(

−i −11 −i

)→(−i −10 0

)→(

1 −i0 0

).

Hence Ei = span((i,1)T ). For the eigenvalue −i, we obtain(i −11 i

)→(

i −10 0

)→(

1 i0 0

).

Hence E−i = span((−i,1)T ). In particular, it follows that the algebraic multiplicities and thegeometric multiplicities are equal to one (Steps 4 and 5). Hence, the matrix A is diagonalizableas an element of M2×2(C), but not as an element of M2×2(R).

40

Example 24. Let A be the matrix given by

A =

λ 1 00 λ 10 0 λ

.

ThenχA(t) = (λ − t)3.

Hence, λ is an eigenvalue of A algebraic multiplicity equal to 3. In here, A−λ1 is already inREF:

A−λ1 =

λ 1 00 λ 10 0 λ

−λ 0 0

0 λ 00 0 λ

=

0 1 00 0 10 0 0

.

Hence Eλ = span((1,0,0)T ), so the geometric multiplicity is equal to one, and hence A is notdiagonalizable.

2.3.3 Spectral theorem for symmetric matricesFor two classes of matrices, it is a priori known, that a matrix is diagonalizable. These classesare defined as follows.

Definition 2.3.7. For A ∈Mn×n(C), the Hermitian conjugate of A is defined by

A∗ := AT .

Here, A refers to matrix, whose entries are the complex conjugates of the entries of A. IfA = AT , then A is called symmetric, and if A = A∗, then one refers to A as Hermitian matrix.

Note that, for A∈Mn×n(R), we have A∗ := AT . Hence a symmetric matrix with real entriesis also Hermitian.

Theorem 2.3.8 (Spectral theorem for Hermitian/symmetric matrices). Assume that either

(i) A ∈Mn×n(C), and A is Hermitian, or

(ii) A ∈Mn×n(R), and A is symmetric.

Then any eigenvalue is an element of R, and A is diagonalizable with respect to an orthonor-mal basis.

Proof (ESM2B). We start with giving the proof for a Hermitian matrix A. By the fundamentaltheorem of algebra, there exists at least one eigenvalue λ of A.

So assume that v ∈ Cn, v 6= 0 is an eigenvector of λ . Without loss of generality, we mayassume that (v,v) = 1 (by replacing v by 1

‖v‖v). Hence, by the characterization of A as self-adjoint operator (see Section 2.3.3 below),

λ = λ (v,v) = (λv,v) = (Av,v) = (v,Av) = (v,λv) = λ (v,v) = λ .

41

Hence, λ = λ , and in particular, λ ∈ R. Using the Gram-Schmidt process (see Section 2.3.3below), there exists an orthonormal basis v,w1, . . .wn−1 of Cn. Note that, for k = 1, . . . ,n−1,

(Awk,Av) = (Awk,λv) = λ (Awk,v) = λ (wk,Av) = λ2(wk,v).

Since v,w1, . . .wn−1 is orthonormal, it follows that (Awk,Av) = 0. Hence

A(span(v))⊂ span(v), and A(span(w1, . . .wn−1))⊂ span(w1, . . .wn−1).

In particular, it follows from this that with respect to this basis, A is transformed to a matrixλ 0 · · · 00 ∗ · · · ∗...

......

0 ∗ · · · ∗

=

λ 0 · · · 00... An−10

,

where An−1 is a matrix of dimension (n− 1)× (n− 1). Since A is a self-adjoint operator,it follows that An−1 is a self-adjoint operator of Cn−1, and in particular is Hermitian. Theassertion then follows by induction.

For symmetric matrices over R, the statement can be deduced from the result for Hermitianmatrices as follows. So assume that A ∈Mn×n(R) is symmetric. Hence, A is also Hermitianas an element of A ∈Mn×n(C). By the above, there exists a real eigenvalue λ . The assertionthen follows by the same arguments as above, where A is considered again as an element ofA ∈Mn×n(R).

Self-adjoint operators (ESM2B)

There is also an abstract approach to these classes of matrices in terms of an endomorphismL : V → V , and an inner product (·, ·) defined on V . The operator is called self-adjoint, if(v,L(w)) = (L(v),w). Recall that, by choosing an orthonormal basis,5 the inner product (·, ·)on V is given by the standard scalar product on Fn. In particular, if A is the basis associated toL via the choice of the orthonormal basis , we obtain, for v,w ∈ Fn,

(v,Aw) = v∗Aw = (A∗v)∗w = (A∗v,w).

Since entry (i, j) of A is equal to (ei,Ae j), it follows that,

(i) for F = R, the endomorphism L is self-adjoint if and only if the associated matrix issymmetric,

(ii) for F = C, the endomorphism L is self-adjoint if and only if the associated matrix isHermitian.

So we obtain the following equivalent formulation of the above theorem.

Theorem 2.3.9 (Spectral theorem for self-adjoint operators). Assume that L : V →V is a self-adjoint endomorphism L : V → V of the finite dimensional vector space V . Then there existsan orthonormal basis such that the associated matrix is a diagonal matrix with entries in R.

5The existence of an orthonormal basis follows from the Gram–Schmidt process below

42

Orthonormalization (ESM2B)

There is an explicit method to construct an orthonormal basis from a given set of linear in-dependent vectors. The corresponding algorithm, the so called Gram–Schmidt process, isdefined as follows.

Assume that v1, . . . ,vn is a set of linear independent vectors in V , and (·, ·) is an innerproduct on V .

e1 :=1‖v1‖

v1 =1√

(v1,v1)v1

b2 := v2− (e1,v2) · e1

e2 :=1‖b2‖

b2

...

bk+1 := vk+1−k

∑i=1

(ei,vk+1) · ei

ek+1 :=1

‖bk+1‖bk+1

...

For these vectors, the following holds.

Proposition 2.3.10. For each k = 1, . . . ,n, the set e1, . . . ,ek is an orthonormal system, and

span(e1, . . . ,ek) = span(v1, . . . ,vk).

In particular, if v1, . . . ,vn is basis of V , then e1, . . . ,en is an orthonormal basis.

Example 25. Let

v1 :=

112

, v2 :=

223

, v3 :=

1−1−3

.

43

By the above algorithm, we obtain

e1 :=v1

‖v1‖=

1√6

112

b2 := v2− (e1,v2) · e1 =

223

− 10√6· 1√

6

112

=13

11−1

e2 :=

1√3

11−1

b3 := v3− (e1,v3) · e1− (e2,v3) · e2

=

1−1−3

− −6√6· 1√

6

112

− 3√3· 1√

3

11−1

=

1−10

e2 :=

1√2

1−10

.

2.3.4 Linear groups (ESM2B)In here, we give a brief account to the following subsets of Mn×n(F). The set

GL(n,F) := A ∈Mn×n(F) : det(A) 6= 0= A ∈Mn×n(F) : A invertible

is called the general linear group. Furthermore, a matrix A ∈Mn×n(R) is called orthogonal, if

(Av,Aw) = (v,w)

for all v,w ∈Rn, where (v,w) refers to the standard scalar product. A matrix is called unitary,if

(Av,Aw) = (v,w)

for all v,w ∈Cn, where (v,w) refers to the standard scalar product. The orthogonal group andthe unitary group are defined by

O(n) := A ∈Mn×n(R) : A orthogonalU(n) := A ∈Mn×n(C) : A unitary

The unitary group

So assume that A ∈U(n). Then, by definition, for i, j = 1, . . . ,n

e∗i A∗Ae j = (Aei,Ae j) = (ei,e j) =

1 i = j0 i 6= 0.

44

Since entry (i, j) of A∗A is equal to (Aei,Ae j), it follows that A∗A = 1. Hence A is unitaryif and only if A−1 = A∗. Moreover, the Hermitian conjugate A∗ of a unitary matrix A is alsounitary.

There is a further characterization of U(n) in terms of orthonormality of row vectors. Solet v1, . . .vn be an orthonormal basis of Cn, and A be the matrix with row vectors v1, . . .vn (infact, this matrix might be seen as the change of basis P′ from v1, . . .vn to the standard basis,cf. Section 2.3.1). For u = ∑λiei, w = ∑ µ je j, using

(ei,e j) = (vi,v j) =

1 i = j0 i 6= j,

it then follows that

(u,w) =(∑λiei,∑µ je j

)=

n

∑i=1

λiµi(ei,ei)+ ∑i6= j

λiµ j(ei,e j) =n

∑i=1

λiµi,

(Au,Aw) =(∑λivi,∑µ jv j

)=

n

∑i=1

λiµi.

Hence, A is a unitary matrix, and we have shown that a matrix whose row vectors form anorthonormal basis of Cn is orthogonal. Conversely, assume that A is a unitary matrix. ThenAei is the i-th row of A. Since A is invertible, it follows that the row vectors of A form a basisof Cn. Moreover, since A is unitary,

(ei,e j) = (Aei,Ae j) =

1 i = j0 i 6= j.

Hence the row vectors form an orthonormal basis of Cn. We have shown the following theo-rem.

Theorem 2.3.11. The following are equivalent, for A ∈Mn×n(C)

(i) (Av,Aw) = (v,w) for all v,w ∈ Cn.

(ii) A−1 = A∗.

(iii) The row vectors of A form an orthonormal basis of Cn.

(iv) The column vectors of A form an orthonormal basis of Cn.

If one of these statements is true, then A is a unitary matrix

The orthogonal group

By the same arguments as above, we obtain the corresponding theorem for O(n)

Theorem 2.3.12. The following are equivalent, for A ∈Mn×n(R)

45

(i) (Av,Aw) = (v,w) for all v,w ∈ Rn.

(ii) A−1 = A∗.

(iii) The row vectors of A form an orthonormal basis of Cn.

(iv) The column vectors of A form an orthonormal basis of Cn.

If one of these statements is true, then A is an orthogonal matrix

Remark (Abstract definition of a group). In here, the name group stems from the followingproperties. So let G be either GL(n,F), O(n) or U(n). It then follows from the above andSections 2.2.2 and 2.2.4

(i) A(BC) = A(BC) for all A,B,C ∈ G ,

(ii) AB ∈ G for all A,B ∈ G ,

(iii) 1 ∈ G , and 1A = A = A1 for all A ∈ G ,

(iv) A−1 ∈ G , and AA−1 = A−1A = 1 for all A ∈ G .

If for a set G , there exists G ×G → G , (A,B) 7→ AB such that properties (i) - (iv) are fulfilled,then G is called a group. Well-known examples for groups are vectors spaces with respect tovector addition, or the set R>0 = x ∈ R : x > 0 with respect to multiplication.

2.3.5 Spectral theorem for normal matrices (ESM2B)Definition 2.3.13. A matrix A ∈Mn×n(C) is called normal if A∗A = AA∗.

Since A−1 = A∗ for unitary and orthogonal matrices, it follows that unitary and orthogonalmatrices are normal. For normal matrices, the following spectral theorem holds.

Theorem 2.3.14 (Spectral theorem for normal matrices). Assume that A is a normal matrix.Then A is diagonalizable with respect to an orthonormal basis.

Proof. The method of proof is similar to the proof of the spectral theorem for Hermitianoperators.

Remarks.

(i) Using the characterization of U(n), we obtain the following equivalent statement ofabove Theorem. If A ∈Mn×n(C) is normal, then there exists P ∈U(n) such that PAP−1

is a diagonal matrix.

(ii) If one applies the above theorem to a normal matrix A ∈Mn×n(R) with real coefficients,it follows that A is diagonalizable over C. This is illustrated in Example 23.

46

2.3.6 The Jordan normal form (ESM2B)We begin with defining the so-called Jordan normal formal.

Definition 2.3.15. The square matrix J is called a Jordan matrix (or sometimes Jordan block),if there exists λ ∈ C and k ∈ N such that

J =

λ 1 0 · · ·0 λ 1 0 · · ·

. . .λ 1

· · · 0 λ

∈Mk×k(C).

Moreover, a matrix A ∈Mn×n(C) is said to be in Jordan normal form if

A =

J1 0 · · ·0 J2 0 · · ·

. . .· · · 0 Jl

,

where J1, . . . ,Jl are Jordan matrices.

Example 26. Examples for Jordan matrices are

(λ ) ∈M1×1(C),(

λ 10 λ

)∈M2×2(C),

λ 1 00 λ 10 0 λ

∈M3×3(C).

Examples for Jordan normal forms are

(3 10 3

),

(3 00 3

),

(2 00 3

),

3 1 00 3 00 0 3

,

3 0 00 3 10 0 3

,

3 1 00 3 10 0 3

.

As it will turn out soon, any endomorphisms can be represented in this form, and up to per-mutation of the Jordan blocks, this representation is unique.6 However, in order to determinethe Jordan normal form for a given matrix, we first have to develop the theory.

Example 27. We will describe the properties of a Jordan matrix A ∈ Mk×k(C), where theentries on the diagonal are equal to λ ∈C. Since A− t1 is a upper triangular matrix, it followsthat

χA = (λ − t)k.

6With other words: if the Jordan normal forms of two matrices coincide up to permutation of the Jordanblocks, then one may obtain one matrix from the other by a change of basis. In particular, this is the reason whythe word ‘normal’ occurs in ‘Jordan normal form’.

47

Hence, λ is the only eigenvalue of A. Furthermore, since

A−λ1 =

0 1 0 · · ·0 0 1 0 · · ·

. . .0 1

· · · 0 0

,

it follows that

(A−λ1)ek = ek−1,(A−λ1)2ek = ek−2, . . . ,(A−λ1)k−1 = e1,(A−λ1)k = 0.

We hence have that the images of ek under 1,(A− λ1), . . . ,(A− λ1)k−1 form the standardbasis e1, . . .ek, and (A−λ1)kek = 0. In particular, (A−λ1)kv = 0 for all v ∈ Ck.

The key concept of the Jordan normal form is to find invariant subspaces, and bases forthese subspaces such that the above property holds for each subspace. Therefore, for an eigen-vector λ ∈ C of A ∈Mn×n(C), set

E1λ

:= ker(A−λ1),E2λ

:= ker((A−λ1)2),E3λ

:= ker((A−λ1)3), . . .

Note that E1λ⊂ E2

λ⊂ E3

λ⊂ ·· · , and that each E l

λ(l ∈ N) is a subspace, since it is a kernel of a

linear operator. Hence, using the fact that Cn is finite dimensional, it follows that there existsk ∈ N such that Ekλ

λ= E l

λfor all l ≥ k.

Definition 2.3.16. Setkλ := mink : Ek

λ= E l

λfor all l ≥ k.

The subspace E∗λ

:= Ekλ

λis called generalized eigenspace of the eigenvalue λ of A.

Lemma 2.3.17. Generalized eigenspaces are invariant, that is A(E∗λ)⊂ E∗

λ), for each eigen-

value λ ∈ C of the matrix A.

Proof. Note that A(A−λ1) = A2−λA = (A−λ1)A. Hence, for v ∈ E∗λ

, we have

(A−λ1)kλ Av = A(A−λ1)kλ v = A0 = 0.

Hence Av ∈ ker((A−λ1)kλ ) = E∗λ

.

In order to proceed, we will now make use of the following facts without proof. In order toformulate these facts, we will need the following. Assume that λ ∈ C is an eigenvalue of thematrix A ∈Mn×n(C). Then the order of v ∈ E∗

λ, v 6= 0 is defined by

ov := mink : (A−λ1)kv = 0.

Note that ov ≤ kλ , and that

(A−λ1)v 6= 0, . . . ,(A−λ1)ov−1v 6= 0,(A−λ1)ovv = 0,(A−λ1)ov+1v = 0, . . .

48

Proposition 2.3.18. For each eigenvalue λ ∈ C of the matrix A ∈ Mn×n(C), the followingholds.

(i) The algebraic multiplicity of λ is equal to dim(E∗λ), and kλ is smaller than or equal to

the algebraic multiplicity of λ .

(ii) There exist v1, . . . ,vl ∈ E∗λ

, such that, for B := A−λ ,

v1,Bv1, . . . ,Bov1−1v1,v2,Bv2, . . . ,Bov2−1v2, . . . ,vl,Bvl, . . . ,Bovl−1vl

form a basis of E∗λ

.

As an immediate consequence of Lemma 2.3.17, and Proposition 2.3.18 (i), we obtainthe following decomposition of Cn. Note that by Lemma 2.3.17, A defines a linear operatorE∗

λ→ E∗

λ, for each eigenvalue λ of A. This operator is called the restriction of A to the

invariant subspace E∗λ

. In particular, by choosing a basis for E∗λ

, we may associate a matrixAλ of dimension (dim(E∗

λ))× (dim(E∗

λ)) with the restriction of A to E∗

λ.

Corollary 2.3.19. Assume that A has eigenvalues λ1, . . . ,λl . For each generalized eigenspaceE∗

λi, choose a basis. Then the union of these bases is a basis of Cn, and after change of basis,

the matrix is of form Aλ1 0 · · ·0 Aλ2 0 · · ·

. . .· · · 0 Aλl

.

In here, Aλi refers to the matrix of dimension (dim(E∗λi))× (dim(E∗

λi)) given by the restriction

of A to E∗λi

and the above choice of a basis for E∗λi

.

Proof. Assume that λ ,µ , λ 6= µ are eigenvalues of A. Then E1λ∩E1

µ = 0. This can be seenby the following argument. For v ∈ E1

λ∩E1

µ ,v 6= 0, we have Av−λv = Av−µv = 0. Hence,v = 0. By definition of generalized eigenspaces, it follows that E∗

λ∩E∗µ = 0.

Hence, the union of the bases of the generalized eigenspaces is a set of linear independentvectors. Using Proposition 2.3.18 (i), we hence have that the dimension of the span of thesevectors is equal to n, and hence this union has to be a basis of Cn. By Lemma 2.3.17, it thenfollows that with respect to this basis, the matrix is of the above form.

Combining Lemma 2.3.17 with Proposition 2.3.18 (ii) then gives the following.

Corollary 2.3.20. Let λ ∈C be an eigenvalue of the matrix A ∈Mn×n(C), and v1, . . . ,vl ∈ E∗λ

as in Proposition 2.3.18 (ii). With respect to the basis

Bov1−1v1, . . . ,Bv1,v1,Bov2−1v2, . . . ,v2,Bv2, . . . ,Bovl−1vl, . . . ,Bvl,vl

of E∗λ

(where B := Aλ −λ1), we then have that

Aλ =

Aλ ,1 0 · · ·

0 Aλ ,2 0 · · ·. . .

· · · 0 Aλ ,l

,

49

where, for each i = 1, . . . , l, the matrix Aλ ,i ∈Movi×ovi(C) is a Jordan matrix:

Aλ ,i =

λ 1 0 · · ·0 λ 1 0 · · ·

. . .λ 1

· · · 0 λ

.

Proof. Fix i ∈ 1, . . . , l. Then, for k = 0, . . .ovi−1, we have

Bk+1v = BBkv = (A−λ1)Bkv = ABkv−λBkv.

Hence ABkv = Bk+1v+λBkv, and in particular, with respect to the Bov1−1v1, . . . ,Bv1,v1, weobtain (cf. Example 27)

Aλ ,i =

λ 1 0 · · ·0 λ 1 0 · · ·

. . .λ 1

· · · 0 λ

∈Movi×ovi(C),

since the row vectors are the images of Bov1−1v1, . . . ,Bv1,v1 under A.

Summarizing the above Corollaries, we obtain the main result of this section.

Theorem 2.3.21. For each endomorphism L of a finite dimensional C-vector space V , thereexists a basis such that the associated matrix is in Jordan normal form.

The question now arises, how one can deduce an algorithm for determining the Jordan nor-mal form from these considerations. The nifty part in here is to find the further decompositionof the generalized eigenspaces.

Example 28. We now derive the Jordan normal form of

A =

−1 0 0 00 1 1 00 −4 5 00 −4 2 3

.

Step 1. Determine χA, and the eigenvalues of A.

−1− t 0 0 00 1− t 1 00 −4 5− t 00 −4 2 3− t

II− 15−t III−→

−1− t 0 0 00 1− t + 4

5−t 0 00 −4 5− t 00 −4 2 3− t

= (−1− t)(1− t +4

5− t)(5− t)(3− t)

= (−1− t)(t2−6t +9)(5− t)(3− t) = (−1− t)(3− t)3.

50

Hence the eigenvalues of A are−1 (of algebraic multiplicity 1) and 3 (of algebraic multiplicity3).Step 2. Geometric multiplicities.

Since the algebraic multiplicity of the eigenvalue−1 is equal to one, it follows that dim(E∗−1)=1. Moreover, since dim(E−1)≥ 1, it follows that the geometric multiplicity is equal to one.

For the eigenvalue 3, we have to determine E13 ,E2

3 , . . .. Therefore, set B = A− 31. Trans-forming B into REF gives

−4 0 0 00 −2 1 00 −4 2 00 −4 2 0

III−2II,IV−2II−→

−4 0 0 00 −2 1 00 0 0 00 0 0 0

−→

1 0 0 00 1 −1

2 00 0 0 00 0 0 0

Hence, dim(E1

3) = 2, and

E13 = span

01210

,

0001

.

In particular, we have to determine E13 .

B2 =

−4 0 0 00 −2 1 00 −4 2 00 −4 2 0

−4 0 0 00 −2 1 00 −4 2 00 −4 2 0

=

16 0 0 00 0 0 00 0 0 00 0 0 0

.

Hence, dim(E23) = 3, and E2

3 = span(e2,e3,e4).

Step 3a. Note that by a purely combinatorial argument, one can already conclude how theJNF of A looks like: there is one block of dimension 1 with respect to the eigenvalue -1, oneblock of dimension 1 (= dim(E2

3)−dim(E13)) with respect to the eigenvalue 3, and a block of

dimension 2 with respect to the eigenvalue 3.Step 3b. Cyclic vectors for B. Choose some element of E2

3 \ E13 , e.g. e2. Then B(e2) =

(0,−2,−4,−4)T , and B2(e2) = 0. Moreover,

B(e2),e2,e4

is a basis for E23 = E∗3 .

Step 4. Summarizing, we obtain the following Jordan normal form.−1 0 0 00 3 1 00 0 3 00 0 0 3

.

51

Moreover, the matrix of the corresponding change of basis is given by

P−1 =

1 0 0 00 −2 1 00 −4 0 00 −4 0 1

.

52

Chapter 3

Probability theory

3.1 Basic notions of set theoryWe first recall the following notions from set theory. Let Ω be a set, and A,B ⊂ Ω subsets ofΩ. Then

(i) A∪B := x ∈Ω : x ∈ A or x ∈ B. (union)

(ii) A∩B := x ∈Ω : x ∈ A and x ∈ B. (intersection)

(iii) A\B := x ∈Ω : x ∈ A and x /∈ B. ((set-theoretic) difference)

(iv) Ac := x ∈Ω : x /∈ A. (complement)

So let A,B,C ⊂Ω. The following identities apply to these operations.

(i) A∪B = B∪A, A∩B = B∩A.

(ii) A∪ (B∪C) = (A∪B)∪C, A∩ (B∩C) = (A∩B)∩C.

(iii) A∪Ω = Ω, A∩Ω = A.

(iv) A∪ /0 = A, A∩ /0 = /0.

(v) A∩ (B∪C) = (A∩B)∪ (A∩C).

(vi) A∪ (B∩C) = (A∪B)∩ (A∪C).

(vii) A\B = A∩Bc.

(viii) A\ (B∪C) = (A\B)∩ (A\C), A\ (B∩C) = (A\B)∪ (A\C).

(ix) (Ac)c = A.

(x) (A∪B)c = Ac∩Bc, (A∩B)c = Ac∪Bc.

53

Moreover, recall that A and B are called disjoint, if A∩B = /0. For a sequence of subsets, say(An : n ∈ N) with A ⊂ Ω for all n ∈ N, these notions are generalized as follows (using theproperties in (ii) above).

(i)⋃

∞n=1 An := x ∈Ω : x ∈ An for at least one n ∈ N.

(ii)⋂

∞n=1 An := x ∈Ω : x ∈ An for all n ∈ N.

(iii) (An : n ∈ N) is called pairwise disjoint (p.w. disjoint), if Ai and A j are disjoint for alli, j ∈ N, i 6= j.

3.2 Discrete probability spacesProbability theory is a relatively new subject in Mathematics. In order to describe randomevents (in most cases wins and losses in gambling), there were several attempts to give amathematical model for certain games etc. However, it took some time, before the suitabledefinition of a probability space was revealed. We begin with the study of discrete probabilityspaces.

Definition 3.2.1. A set Ω is called finite, if it only contains finitely many elements. The numberof elements in Ω is called the cardinality, and is denoted by ]Ω. Moreover, a set is calledcountable, if it is either finite, or it is possible to enumerate the elements of Ω.1 That is, Ω canbe written as

Ω = an : n ∈ N.

A set which is not countable is called uncountable.

Examples for countable sets are head, tail, 1,2,3,4,5,6, N, Z and Q, and examplesof uncountable sets are R or C. Moreover, ]head, tail = 2, and ]1,2,3,4,5,6 = 6. Thecountability of these sets can be shown using the following arguments.

(i) The first two sets are countable, since they are finite.

(ii) For N, set an = n.

(iii) In order to obtain an enumeration for Z, set a1 = 0, a2 = 1, a3 =−1, a4 = 2, a5 =−2,. . ..

(iv) In order to find an enumeration for Q, the so called diagonal argument is applicable.Therefore, the elements of Q are arranged in a table, where each cell corresponds to the

1The precise mathematical definition of enumeration is the following: Ω is referred to as a countable set ifthere exists a map p : N→Ω such that p(N) = Ω.

54

element p/q with p referring to the number of the column, and q to the number of therow, respectively.

0 1 −1 2 −2 · · ·1 a1 a2 a4 a7 a11 · · ·2 a3 a5 a8 a12

3 a6 a9 a13. . .

4 a10 a14. . .

5 a15. . .

......

Moreover, note that we do not have to require that an 6= am for n 6= m. In the aboveexample, we have e.g. a2 = 1/1 = a12 = 2/2, or 0 = a1 = a3 = a6 = a10 = a15 = · · · .

A discrete probability measure is defined as an additive set function on the set of subsets ofa countable set.

Definition 3.2.2 (Discrete probability space). Let Ω be a countable set. Then a function Pfrom the set of subsets of Ω to [0,1] is called probability measure if

(i) P(Ω) = 1, P( /0) = 0.

(ii) For a sequence (An : n ∈ N) of pairwise disjoint subsets of Ω,

P

(∞⋃

n=1

An

)=

∞

∑n=1

P(An).

Furthermore, one refers to the pair (Ω,P) as a discrete probability space. In here, due tothe aim to model random events, the set Ω is referred to as the sample space, A⊂Ω is calledevent, and P(A) the probability of the event A. There are several immediate consequences for(Ω,P).

Proposition 3.2.3. Assume that Ω is countable, and that P is a probability measure on Ω.

(i) P(Ac) = 1−P(A) for all A⊂Ω.

(ii) If A⊂ B, then P(A)≤ P(B) (for all A,B⊂Ω).

(iii) P(A∪B) = P(A)+P(B)−P(A∩B)

Proof. The idea of proof relies in each case in a disjoint decomposition of the correspondingsets. In here, we will only give a proof of (ii):

P(B) = P(B∩A︸︷︷︸=A

)+P(B\A) = P(A)+P(B\A)︸︷︷︸≥0

≥ P(A).

55

We now give examples of probability measures to illustrate how to model random events.

Example 29 (Throwing a fair die). Here, we have Ω = 1,2, . . .6, and

P(1) = P(2) = · · ·= P(6) :=16.

Using property (ii) in the definition of a probability measure, we can now extend P to allsubsets, e.g.

P(1,5,6) =16

+16

+16

=12

Example 30 (Throwing two fair dice). Here, we have Ω = 1,2, . . .6×1,2, . . .6, and, for(i, j) ∈Ω,

P((i, j)) :=136

.

As above, using property (ii) in the definition of a probability measure, P can now be extendedto all subsets, e.g.

P((x,1) : x ∈ 1,2, . . .5) =1

36+

136

+1

36+

136

+136

=5

36.

Example 31 (Radioactive decay, discrete model). Assume we have a single radioactive atom.If we now want to model the waiting time (in years) until the atom decays given the half-lifeK (in years). This is modelled as follows.

(i) Ω = N.

(ii) We now assume that P(n) = Cqn, for some q ∈ (0,1), and C ∈ R. Since P(Ω) has tobe equal to one, it follows that2

1 =∞

∑n=1

P(n) =∞

∑n=1

Cqn = Cq∞

∑n=0

qn =Cq

1−q.

Hence, for given q, C = (1−q)/q. Furthermore, by the definition of the half-life K,

12

=K

∑n=1

P(n) =K

∑n=1

Cqn = (1−q)K−1

∑n=0

qn = 1−qK.

Hence,

q =(

12

)1/K

.

2In here, we use the summation formulae of the geometric series:

∞

∑n=0

qn =1

1−q,

k

∑n=0

qn =1−qk+1

1−q

.

56

3.2.1 Combinatorics, and uniform sample spacesA uniform sample space is a at the first sight a simple case of a discrete probability space.Namely, (Ω,P) is called a uniform sample space if Ω is a finite set, and, for A⊂Ω, we have

P(A) =]A]Ω

.

In fact, the main problem in here is to obtain ]Ω, for a given experiment. E.g., if one con-siders a ”6 out of 49”-lottery, then it is reasonable, that each possible outcome has the sameprobability. Hence, in order to determine this probability, one has to determine the number ofall possible outcomes (this will by solved using combinations without repetitions - there are(49

6

)= 13983816 possible outcomes).

Permutations

A permutation is defined to be an ordered arrangement of several elements. In here, we willassume that we have n distinguishable elements (e.g. numbered with 1, . . .n), and that wechoose k of them.

Case 1. Permutations with repetitions. So assume we choose k times one element out of1, . . .n. Hence we obtain an element (n1, . . .nk) ∈ 1, . . .nk. The total number ofpossible outcomes is

n ·n · · ·n︸︷︷︸k times

= nk,

since at each step, there are n possibilities.

Case 2. Permutations without repetitions. So assume we choose one element n1 out of1, . . .n. After that, choose an element out of 1, . . .n \ n1, then n3 ∈ 1, . . .n \n1,n2, then n4 ∈ 1, . . .n \ n1,n2,n3, and so on. By the same argument as above,we obtain that, for k ≤ n, that the total number of possible outcomes is

n · (n−1) · · ·(n− k +1)︸︷︷︸k factors

=n!

(n− k)!,

where n! denotes the factorial of n, that is

n! := 1 ·2 ·3 ·4 · · ·(n−1) ·n.

In order to see the difference between these two notions, consider the following two questions.

(i) How many different words of length 4 exist, containing the letters JACOBS? By Case1, we obtain that there exist

64 = 1296

different words.

57

(ii) How many different words of length 4 exist, containing the letters JACOBS, and witheach letter occurring at most once? By Case 2, we obtain that there exist

6!4!

=72024

= 30

different words.

Combinations

A combination is defined as an unordered arrangement of several elements. In here, we willassume that we have n distinguishable elements (e.g. numbered with 1, . . .n), and that wechoose k of them.

Case 2. Combinations without repetitions. So assume we choose one element n1 out of1, . . .n. After that, choose an element out of 1, . . .n \ n1, then n3 ∈ 1, . . .n \n1,n2, then n4 ∈ 1, . . .n \ n1,n2,n3, and so on. The total number of possibleoutcomes is (

nk

),

where(n

k

)(“n choose k”) is defined by(

nk

):=

n!k!(n− k)!

.

Proof. By the number of possible outcomes for permutations without repetitions, weknow that there are n!/(n−k)! possible outcomes if the ordering is not dropped. Hence,one only has to divide by the number of possible arrangements of k-elements, that is k!,to obtain the total number of possible outcomes.3

Remarks.

(i)(n

k

)is also called binomial coefficient, since by the binomial theorem, we have

(a+b)n =n

∑k=0

(nk

)akbn−k.

(ii) There is also a further interpretation of this result. Assume one has n places toput k (indistinguishable) objects, where in each place, there may be at most oneelement. Then, the number of possible arrangements is

(nk

). This might be seen

as follows. In the original model, we end up with a finite sequence of elements in1, . . . ,n, where each element of 1, . . . ,n appears at most once. By identifyingthese elements with the corresponding places, the assertion follows.

3This is called shepherd’s method, since by a french tale, a shepherd counted his sheep with the followingmethod: he counted the number of legs, and then divided this number by 4.

58

Case 1. Combinations with repetitions. So assume we choose k times one element out of1, . . .n. Hence we obtain elements n1, . . .nk ∈ 1, . . .n. In this case, the total numberof possible outcomes is (

n+ k−1k

).

Proof. The proof will make use of the trick, that the set of possible outcomes is mappedby an invertible map to another set. Then the cardinalities of these sets have to be equal.

So assume that n1, . . .nk is the outcome after choosing k elements with repetition. Bysorting the elements by their size, we obtain

n(1),n(2), . . . ,n(k) with n(i) ≤ n(i+1), for i = 1, . . .k−1.

In particular, the problem is transferred to determine the number of possibilities for setsof type (n(1),n(2), . . . ,n(k)). Here the above trick now applies. For each i = 1, . . .k, let

π(n(i)) := n(i) + i−1.

Hence we obtain a map

(ω1,ω2, . . .ωk) ∈ 1, . . . ,nk : ω1 ≤ ω2 ≤ ·· · ≤ ωk→ (ω1,ω2, . . .ωk) ∈ 1, . . . ,n+ k−1k : ω1 < ω2 < · · ·< ωk,(n(1),n(2), . . . ,n(k)) 7→ (π(n(1)),π(n(2)), . . . ,π(n(k)))

Furthermore, it is obvious, that there exists an inverse of this map. Hence, the solutionfor the original problem is given by the cardinality of the set

(ω1,ω2, . . .ωk) ∈ 1, . . . ,n+ k−1k : ω1 < ω2 < · · ·< ωk.

Using the result for combinations with repetitions, we obtain that

](ω1,ω2, . . .ωk) ∈ 1, . . . ,n+ k−1k : ω1 < ω2 < · · ·< ωk=(

n+ k−1k

).

Multinomial coefficients

Finally, we will derive the number of possible arrangements in the following situation. Assumethat n = n1 +n2,+ . . .+nm, for ni ∈ N, and assume that there are

n1 indistinguishable objects of type 1.n2 indistinguishable objects of type 2.

......

...nm indistinguishable objects of type m.

59

As an example, consider an urn with, say n1 = 2 red balls, n2 = 3 blue balls, and n3 = 5green balls. We are now interested in the number of possible (ordered) arrangements of thesen objects. The answer is

n!n1! ·n2! . . .nm!

.

Proof. The argument is the same as in combinations without repetitions. If the objects wouldbe distinguishable, there would be n! different possibilities. One now has to determine thenumber of arrangements, which are identified under the original model.

Note that two given arrangements (in the model with distinguishable objects) represent thesame outcome in the given model if and only if the objects of each type are rearranged. Sincethere are n1! ·n2! . . .nm! possibilities to do that, we arrive at a total number of

n!n1! ·n2! . . .nm!

.

Note that n!/(n1! ·n2! . . .nm!) is also called multinomial coefficient.4 The above result nowenables to answer the following problems.

(i) How many different words of length 11 can be formed from the word “MISSISSIPPI”?Using the multinomial coefficient, it follows that the number of different words is

11!1!4!4!2!

= 34650.

(ii) Moreover, the multinomial coefficient tells, how many different partitions of a set ofsize n into sets of size n1, . . .nm exist.

Stirling’s formula

Sometimes it is quite useful to have an approximative formula for the factorial at hand, sincethe calculation of factorials of big numbers can take a while. The approximation below is dueto Stirling.

Proposition 3.2.4 (Stirling’s formula). We have

limn→∞

√2πnnne−n

n!= 1.

4This is due to the multinomial theorem, which states that

(x1 + x2 + · · ·+ xm)n = ∑n1,...,nm: n1+···nm=n

n!n1! ·n2! . . .nm!

xn11 · · ·x

nmm .

60

3.2.2 Conditional probabilitiesSo let (Ω,P) be a probability space. In fact, the results of this section also apply to continuousprobability spaces, which are defined below. The only difference is that ‘⊂’ has to be replacedby ‘∈B’.

For a given event B⊂Ω it is often of interest to determine the probability of an event giventhat event B is true. For example, what is the probability that a radioactive atom decays in thenext 5 years, when it is known that it did not decayed the last 100 years (see Example 31 forthe underlying probability space). The answer is given by so called conditional probabilities.

Definition 3.2.5. Assume that A,B⊂Ω with P(B) > 0. Then

P(A|B) :=A∩BP(B)

is called the conditional probability of A given the event B. Furthermore, the two events A,Bare called independent, if

P(A∩B) = P(A)P(B).

Note that, if A,B⊂Ω with P(A),P(B) > 0 then indepence of A,B is equivalent to

P(A|B) = P(A), P(B|A) = P(B).

We now apply conditional probabilities to the above example with radioactive decay. ForA := 101,102,103,104,105 and B := 1, . . .100c, we have

P(A|B) =P(101,102,103,104,105)

1−P(1, . . .100)

=1−q

q ∑5n=1 qn

1− 1−qq ∑

100n=1 qn

=1−q

∑

4

n=0qn

1−1−q∑99n=0 qn

=1−q5

1− (1−q100)=

1−q5

q100 .

In order to make the difference more visible: assume that the half-life is hundred years. Thenq = 100

√1/2, and

P(A) =1−q

q

5

∑n=1

qn = 1− (1/2)1/20 ≈ 0.03406,

P(A|B) = 2∗P(A)≈ 0.068127.

Furthermore, A and B are not independent, since P(A∩B) = P(A) 6= P(A)P(B) = P(A)/2.Moreover, conditional probabilities can be used to construct new probability spaces. That

is, for fixed B⊂Ω with P(B) > 0, (Ω,P(·|B)) is again a probability space.5 An important toolfor treating conditional probabilities are Bayes’ rules. In order formulate them, one first hasto introduce the notion of a partition. A partition is a countable set A , finite or infinite, ofpairwise disjoint subsets of Ω such that their union is the whole space. That is,

5The notation P(·|B) stands for the map

P(·|B) : A : A⊂Ω→ [0,1], A 7→ P(A|B).

61

(i) P(A∩B) = /0 for all A 6= B,

(ii)⋃

A∈A A = Ω.

The following theorem states different versions of Bayes’ rule. They all rely on the identity

P(A|B)P(B) = P(A∩B) = P(B|A)P(A), (3.1)

and relate P(A|B) with P(B|A).

Theorem 3.2.6 (Bayes’ rules).

Bayes’ rule 1. Let A,B⊂Ω with P(A),P(B) > 0. Then

P(A|B) =P(B|A)P(A)

P(B)

Bayes’ rule 2 (Law of the total probability). Let A be a partition with elements of (strictly)positive probability (that is P(B) > 0 for all B ∈A ). Then, for all A⊂Ω,

P(A) = ∑B∈A

P(B)P(A|B).

Bayes’ rule 3. Let A be a partition with elements of (strictly) positive probability (that isP(B) > 0 for all B ∈A ). Then, for all A,C ⊂Ω with P(A),P(C) > 0,

P(C|A) =P(A|C)P(C)

∑B∈A P(A|B)P(B).

Bayes’ rule 4. Let A,B⊂Ω with P(A),P(B) > 0. Then

P(B|A) =P(A|B)P(B)

P(A|B)P(B)+P(A|Bc)P(Bc).

Proof. Bayes’ rule 1 is a direct consequence of Equation 3.1. The law of total probability (orBayes’ rule 2) is a consequence of the fact that A is a partition:

P(A) = P(A∩Ω) = P

(A∩

⋃B∈A

B

)

= P

( ⋃B∈A

A∩B

)= ∑

B∈AP(A∩B)

= ∑B∈A

P(B)P(A|B) .

Combining Bayes’ rule 1 and 2, we obtain rule 3 as follows.

P(C|A) Rule 1=P(A|C)P(C)

P(A)Rule 2=

P(A|C)P(C)∑B∈A P(B)P(A|B)

.

Finally, Bayes’ rule 4 is a special case of rule 3, where the partition is given by B,Bc.

62

The classical application of Bayes’ rules is given by the following example. Assume thatthe rate of infection of a certain disease in a population is known. Moreover, there is a testavailable whose performance is known (i.e. the success rate of revealing if a person is infectedor not). This is modelled by the following events and probabilities for a single person.

B := the person is infectedA := the result of the test is positive

By the assumptions, the following probabilities are assumed to be known:

P(B) := rate of infection,

P(A|B) := rate of revealing the infection by the test for a infected person,

P(A|Bc) := rate of a ‘false positive’.

In this situation, one would like to know the probability of beeing infected given that theoutcome of the test is positive. So, we apply Bayes rule 4, and obtain

P(B|A) =P(A|B)P(B)

P(A|B)P(B)+P(A|Bc)P(Bc)=

P(A|B)P(B)P(A|B)P(B)+P(A|Bc)(1−P(B))

,

where all the quantities on the right hand side are known.

3.2.3 Important discrete probability spacesIn this section, important probability spaces are introduced.

The binomial distribution B(n, p)

The binomial distribution models number of successes while tossing an unfair coin n times.Therefore, the first step is to introduce the Bernoulli distribution (‘tossing an unfair coin onetime’). For p∈ (0,1), the Bernoulli distribution is defined by the probability space (Ω,P) with

Ω := 0,1, P(0) = 1− p, P(1) = p.

The binomial distribution is now defined by, for p ∈ (0,1), and n ∈ N,

Ω := 0,1, . . . ,n, P(k) :=(

nk

)pk(1− p)n−k.

For a graphical illustration of these probabilities, see Figure 3.1. The binomial distributiondescribes the number of 1’s in n independent Bernoulli trials. This can be deduced from thefollowing. Let

Ω∗ := 0,1n, P∗( ω1, . . .ωn︸︷︷︸

k−times 1,n−k−times 0

) := pk(1− p)n−k.

By adding the possible combinations such that the number of 1’s (these are(n

k

)pk(1− p)n−k),

the assertion follows.

63

0 2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Binomial distribution with parameters n=10, p=0.2

k

prob

abili

ty(k

)

Figure 3.1: Binomial distribution with n = 10, p = 0.2.

Multinomial distribution

The multinomial distribution is a generalization of the binomial distribution, and models thenumber of times each side of an unfair, k-sided die shows up in n independent trials. A singlethrow of a die is modelled by

Ω := 1,2, . . .k, P( j) = p j,

where k ∈ N, and p1, . . . pk ∈ [0,1] with p1 + · · ·+ pk = 1. For the outcome of n independentexperiments, we hence obtain the probability space

Ω := (n1, . . . ,nk) : n1 + · · ·+nk = n, P((n1, . . . ,nk)) =n!

n1! · · ·nk!pn1

1 · · · pnkk .

Hypergeometric distribution

The hypergeometric distribution describes the outcome of the following experiment. Assumethat an urn contains W white balls, and B black balls. One is now interested, after havingrandomly chosen n ball out of the urn without repetition, in the probability of the event of wwhite balls and b black balls, with b+w = n. The corresponding probability space is given by

Ω := (b,w) : b+w = n, P((b,w)) :=

(Ww

)(Bb

)(W+Bw+b

) .

A classical application of the hypergeometric distribution is to determine the probability ofhaving precisely w correct numbers in a ‘6 out of 49’-lottery. Namely, this probability is givenby (6

w

)( 436−w

)(496

) .

64

0 1 2 3 4 5 60.

00.

10.

20.

30.

4

w

P(w

)

Figure 3.2: The hypergeometric distribution with W = 6, B = 43, b+w = 6.

In here, the 6 numbers which were drawn in the lottery correspond to W = 6 white balls, andthe remaining 43 numbers to the black balls. The probabilities for w = 0,1, . . .6 are depictedin Figure 3.2.

0 5 10 15 20

0.00

0.05

0.10

0.15

0.20

0.25

Poisson distribution with parameter lambda=2

k

prob

abili

ty(k

)

Figure 3.3: The Poisson distribution with parameter λ = 2

The Poisson distribution Pois(λ )

The Poisson distribution is used to model the number of occurrences of rare events. In par-ticular, a well-known application in history of this distribution was to model the number ofsoldiers killed by horse kicks in the Prussian cavalry (this can be found in a text book aboutstatistics from 1898). The probability space is defined by

Ω := 0,1,2,3, . . ., P(k) := e−λ λ k

k!.

65

Note that P(Ω) = 1 due to the series representation of the exponential function

eλ =∞

∑j=0

λ k

k!.

3.3 Continuous probability spacesIf one wants to define a probability measure on Ω = R, there is the following problem. Theredoes not exist a function P from all subsets of R to [0,1] such that for a sequence (An : n ∈ N)of pairwise disjoint subsets of Ω, we have

P

(∞⋃

n=1

An

)=

∞

∑n=1

P(An).

In order to solve this problem one has to restrict the definition of P to a suitable class of subsetsof R which are ‘measurable’. The problem with defining P with respect to all subsets of Rmight be illustrated by the Banach-Tarski-paradox: It is possible to decompose a ball intofinitely many pieces such that these pieces can be rearranged such that the precisely fit intoto balls of the same size as the previous one. Hence, by considering all subsets of R3, it ispossible to construct objects of double volume just by cutting and gluing.

3.3.1 MeasurabilityThe class of measurable subsets of R is defined as follows.

Definition 3.3.1. The Borel σ -algebra is defined inductively as follows. Starting with the set

A0 := [a,b] : a,b ∈ R,

one defines the set A1 to be set of all sets using the set operations ∪∞i=1Ai, ∩∞

i=1Ai, Ac) usingsets from A0. Then A2 is defined to be set of sets obtained using the set operations applied toA1, etc. Note that A0 ⊂A1 ⊂ ·· · . The Borel σ -algebra B(R) = B is defined by

B :=∞⋃

n=0

An,

and a set A ∈B is called measurable.

It is important to note that this is not as major restriction - in fact, B contains sufficientlymany sets for rather all applications. For example, by choosing sequences an a and bn bfor a,b ∈ R, a < b, it follows that

(a,b) =∞⋃

n=1

[an,bn] ∈A1.

66

So a countable union of open intervals (i.e. intervals of type (a,b)) is contained in A2, andtheir complement in A3. So, e.g. complicated sets like the Cantor set6 are elements of A4.

Using the concept of measurability, it is now possible to obtain the correct notion of aprobability measure.

Definition 3.3.2 (Probability space on R.). Then a function P : B→ [0,1] is called probabilitymeasure if

(i) P(Ω) = 1, P( /0) = 0.


P

(∞⋃

n=1

An

)=

∞

∑n=1

P(An).

Furthermore, for a subinterval I of R,7 let B(I) := A∩ I : A ∈ B(R). Then a functionP : B(I)→ [0,1] is called probability measure if

(i) P(Ω) = 1, P( /0) = 0.


P

(∞⋃

n=1

An

)=

∞

∑n=1

P(An).

3.3.2 Densities and distribution functionsTo each probability measures P on R, we can associate a F : R→ [0,1], defined by

FP(x) = P((−∞,x]).

Furthermore, each function F : R→ [0,1] which is nondecreasing with limx→∞ F(x) = 1 andlimx→−∞ F(x) = 0, let PF((a,b]) := F(b)−F(a). By the construction of B, it now followsthat PF can be extended to all measurable sets, such that the resulting function PF : B→ [0,1]is a probability measure. Due to this fact, a function F : R→ [0,1] which is nondecreasingwith limx→∞ F(x) = 1 and limx→−∞ F(x) = 0 is called a distribution function. In particular,we have shown the following.

6The Cantor set is defined as the set

C :=

∑i=1

ai

3i : ai ∈ 0,2

.

Using this representation, it can be shown that C is the complement of open intervals of the form (1/3,2/3),(1/9,2/9), (7/9,8/9),(1/27,2/27), (7/27,8/27), . . ., and (−∞,0), (1,∞).

7e.g. I = [a,b], (a,b), (a,b], [a,b), (a,∞) etc.

67

Theorem 3.3.3. To each probability measure P, there exists an associated distribution func-tion, and vice versa.

Moreover, it is possible to define a probability measures via density functions. A functionf : R→ R is called density function if f ≥ 0, and∫

∞

−∞

f (x)dx = 1.

As it easily can be seen,

F(x) :=∫ x

−∞

f (y)dy

is a distribution, and in particular,

P((a,b]) := F(b)−F(a) =∫ b

af (x)dx

defines a probability measure. Hence, each density gives rise to a probability measure. How-ever, there exist probability measures which do not have a density. The standard example for ameasure like this is the so called δ -distribution, which is defined by δ (0) = 1. This alreadydefines a probability measure, since the only way to extend this definition to all measurablesets is the following:

δ (A) :=

1 0 ∈ A0 0 /∈ A.

Note that this measure only assigns to 0 ∈R positive probability, all other elements of R haveprobability 0. Moreover, the distribution function of δ is

Fδ (x) :=

0 x < 01 x≥ 0 /∈ A.

However, δ is not given by a density function, since a function with∫ 0

0f (x)dx = 1

does not exist.In here, we will exclusively consider probability measures on R which are defined via

density functions. A probability measure defined via a density is also called continuous prob-ability measure.8 For continuous probability measures a single point has probability 0 whichfollows by the same argument, which gives that δ has no density.

8This name stems from the following facts. The Lebesgue measure µ is the infinite measure defined byµ((a,b]) := b− a. As in the case of a probability measure, this gives rise to a function µ : B→ [0,∞), whereµ(Ω) = ∞ instead of 1. A probability measure P is called absolutely continuous with respect to µ , if µ(A) = 0implies that P(A) = 0 (for A ∈B). As a consequence of the Radon-Nikodym theorem, this is equivalent to theexistence of a density function of P.

68

Finally, note that each probability measure on an interval I can be extended to a probabilitymeasure by setting P(Ic) = 0. The corresponding distribution function on R is then defined asabove. Moreover, if the probability is defined via a density f : I→ R, the extended measureon R has density g : R→ R by

g(x) :=

f (x) x ∈ I0 x ∈ Ic.

The uniform and exponential distributions below are examples for these extensions. The uni-form distribution can be seen either as a probability measure on [a,b], or equivalently as itsextension to R.

3.3.3 Probability measures on R and generalized functions (ESM 2B).Assume that P is a probability measure with distribution function F . Then, for A := (a,b]⊂R,define ∫

1A(x)dF(x) := P(A) = F(b)−F(a).

This definition can be uniquely extended to nonnegative continuous functions, even if the valueof∫

f dF might be equal to ∞. By setting f+(x) := max f (x),0 and f−(x) := max− f (x),0,one obtains ∫

f dF :=∫

f+dF−∫

f−dF

given that∫

f+dF and∫

f−dF are finite. This integral is called Riemann-Stieltjes-integral,and defines an element of C∞

c (R)′.This gives rise to the following interpretation. Recall that for a continuous density f , the

distribution function is defined by

F(x) :=∫ x

−∞

f (t)dt.

In this case, F is differentiable, and F ′ = f by the fundamental theorem of calculus. With thenotion of the derivative for a generalized function at hand, one could define dF as a density,even it is not a function.

Furthermore,∫

f dF ≥ for f ≥ 0, and∫

1dF = 1 since P is a probability measure. Anelement of C∞

c (R)′ with this property is called positive and normalized, and by the Daniell-Stone theorem, each positive and normalized element of C∞

c (R)′ defines a probability measure.

3.3.4 Important continuous probability spacesThe following notation for a given set A ∈B allows to describe probabilities of events in ashort way. That is, with 1A : R→ R referring to the function given by

1A(x) =

1 x ∈ A0 x /∈ A,

69

the integral of a function f restricted to the set A might be written as (provided that the integral1A f exists) ∫

∞

−∞

1A(x) f (x)dx =:∫

Af (x)dx.

In particular, if P denotes a probability measure given by a density f , the probability of theevent A is given by

P(A) =∫

Af (x)dx =

∫1A(x) f (x)dx.

For example, for A := [a,b]∪ [c,d] (a,b,c,d ∈ R, and a < b < c < d),

P(A) =∫

Af (x)dx =

∫ b

af (x)dx

∫ d

cf (x)dx.

Uniform distribution U([a,b])

The uniform distribution on a given Intervall [a,b] (for a,b ∈ R, and a < b) has the densityfunction

f (x) :=1

b−a1[a,b](x).

Hence, the distribution function of the uniform distribution is given by

F(x) :=∫ x

−∞

1b−a

1[a,b](x)dx =

0 x < ax−ab−a a≤ x≤ b1 x > b.

For future reference, the distribution will be denoted by U([a,b]). For the graph of densityand distribution function, see Figure 3.4.

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

Figure 3.4: Density (black) and distribution function (blue) of the uniform distribution

70

Exponential distribution Exp(λ )

The exponential distribution is usually applied for modeling waiting times in continuous time.The distribution is defined as follows, for a given parameter λ > 0.

Ω = [0,∞)

f (x) = λe−λx

For the exponential distribution, the distribution function can be calculated explicitly.

F(x) =∫ x

0λe−λ tdt = −e−λ t

∣∣∣x0=−e−λx +1 = 1− e−λx

For a graphical representation, see Figure 3.5. Hence, the probability of the event [a,b]⊂R+

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.2

0.4

0.6

0.8

1.0

Figure 3.5: Densities and distribution functions for λ = 1 (solid) and λ = 2 (dashed).

is given by

P([a,b]) =∫ b

aλe−λ tdt = F(b)−F(a) = e−λa− e−λb.

An exponential distribution with parameter λ will be denoted by Exp(λ ).

Example 32 (Radioactive decay, continuous model). . Assume we again have a single ra-dioactive atom. If we now want to model the waiting time (in continuous time) until the atomdecays given the half-life K (in continuous time). This is modelled using the exponentialdistribution.

(i) Ω = [0,∞).

71

(ii) As in the discrete model, we have to solve the following identity to obtain the paramaterλ for given K.

12

= P([0,K]) = F(K) = 1− e−λK

Hence,

e−λK =12⇒−λK = log

12⇒ λ =−K · log

12

= K log(2)

Normal distribution N(µ,σ2)

The normal distribution is probably the most important distribution in statistics (this is a con-sequence of the central limit theorem, see below). The normal distribution is defined by, forparameters µ ∈ R and σ > 0,

Ω = R

f (x) =1√

2πσe−

12σ2 (x−µ)2

.

For abbreviation, this distribution will be denoted by N(µ,σ2). For graphs of the density func-tion, see Figure 3.6. The normal distribution with parameters µ = 0 and σ2 = 1 (i.e. N(0,1))

-0,8 -0,4 0 0,4 0,8 1,2 1,6 2

0,25

0,5

0,75

1

Figure 3.6: Densities of N(0,2) (red), N(0,1) (black), N(0,1/2) (blue), N(1,4) (red, dotted),N(1,1) (black, dotted) and N(1,1/4) (blue, dotted).

is called standard normal distribution. In contrast to the uniform and exponential distribution,it is not possible to determine the distribution function explicitly. Therefore, explicit calcula-tions are usually done using tables with numerical values of the standard normal distribution(see e.g. Table 3.1), or a computer. After introducing random variables, we will see how touse these tables to calculate probabilities for N(µ,σ2) with µ,σ2 arbitrary.

72

x 0,0 0,1 0,2 0,3 0,4Φ(x) 0,5000000 0,5398278 0,5792597 0,6179114 0,6554217x 0,5 0,6 0,7 0,8 0,9Φ(x) 0,6914625 0,7257469 0,7580363 0,7881446 0,8159399x 1,0 1,1 1,2 1,3 1,4Φ(x) 0,8413447 0,8643339 0,8849303 0,9031995 0,9192433x 1,5 1,6 1,7 1,8 1,9Φ(x) 0,9331928 0,9452007 0,9554345 0,9640697 0,9712834x 2,0 2,1 2,2 2,3 2,4Φ(x) 0,9772499 0,9821356 0,9860966 0,9892759 0,9918025

Table 3.1: Distribution function of the standard normal distribution

3.4 Random variablesRandom variables are a very useful concept to describe sums, products or expected values ofrandom objects. The definition is the following. Assume that (Ω,P) is an abstract probabilityspace, and that

X : Ω→ R

is a map.9 Then X is called random variable. Furthermore, by defining

PX((a,b]) := P(ω ∈Ω : X(ω) ∈ (a,b])

for an interval [a,b] ⊂ R, it follows that each random variable has its distribution. Eventswhere random variables are involved are written as

[a < X ≤ b] := ω ∈Ω : X(ω) ∈ (a,b], or [ f (X)≤ b] := ω ∈Ω : f (X(ω))≤ b,

where f : R→ R is some (measurable) function. The corresponding probabilities are thenwritten as

P(a < X ≤ b), P( f (X)≤ b).

Note that most of the probability measures introduced in here, give rise to random variables byconsidering the identity map. The corresponding random variables are then called binomialdistributed, Poisson distributed, normal distributed, . . . random variables.

As a first useful application, let X1,X2, . . . ,Xn be Bernoulli-distributed random variableswith respect to p∈ (0,1). That is, Xi : 0,1→0,1 fo i = 1, . . .n. Provided that X1,X2, . . . ,Xnare independent (see below for the definition), we then obtain that

X1 +X2 +X3 + · · ·+Xn

again is a random variable, which is B(n, p) distributed.

9to be precise: (Ω,A ,P) is a probability space, where A is a σ -algebra, and X has to be measurable, that isX−1([a,b]) ∈A for all intervals [a,b]⊂ R.

73

Transformation rules

So let X be a random variable, λ ∈ R\0, a ∈ R, A ∈B and ψ : R→ R be a measurable map.We then have for the following events that

[λX ∈ A] = [λX ∈ A], [X +a ∈ A] = [X ∈ A−a], [ψ(X) ∈ A] = [X ∈ ψ−1(A)].

In particular, it follows that

P(λX ∈ A) = P(λX ∈ A), P(X +a∈ A) = P(X ∈ A−a), P(ψ(X)∈ A) = P(X ∈ψ−1(A)).

Example 33. For a N(µ,σ2)-distributed random variable we obtain using the above rules andsubstitution y = (x−µ)/σ in the integration that

P(

X−µ

σ≤ a)

= P(X ≤ aσ + µ)

=∫ aσ+µ

−∞

1√2πσ

e−12(

x−µ

σ )2

dx

=∫ a

−∞

1√2π

e−12 y2

dy.

In particular, it follows that X−µ

σis a standard normal distributed random variable.

3.4.1 Sums of independent random variablesFor the later use, the notion of independent random variables is of big importance.

Definition 3.4.1. So let X ,Y be random variables which are defined on the same probabilityspace.10 Similar to the notion of independent events, X ,Y are called independent if

P(X ∈ A,Y ∈ B) = P(X ∈ A)P(Y ∈ B)

for all A,B ∈B.Furthermore, the random variables X1,X2, . . .Xn are called independent, if

P(X1 ∈ A1, . . .Xn ∈ An) =n

∏i=1

P(Xi ∈ Ai).

for all A,B ∈B. Finally, a sequence of random variables (Xi : i ∈N) is called independent, ifeach finite subset of this sequence consists of independent random variables.

In particular, the sum of independent random variables is of interest. In order to determinethe density of the sum of two independent, continuous random variables X ,Y with densitiesfX ,gY , the density of X +Y can be determined as follows.

P(X +Y ≤ α) = P((X ,Y ) ∈ (x,y) : x+ y≤ α)

=∫(x,y): x+y≤α

fX(x) fY (y)dxdy.

10This means that X ,Y are both maps from the same probability space to R

74

Using the substitution x = u− y, we obtain

P(X +Y ≤ α) =∫(x,y): x+y≤α

fX(x) fY (y)dxdy

=∫(u,y): u≤α,y∈R

fX(u− y) fY (y)dudy

=∫

α

−∞

∫∞

−∞

fX(u− y) fY (y)dydu.

Hence the density of of X +Y is given by∫fX(u− y) fY (y)dy.

Note that this quantity is also called convolution of fX and gY .For the discrete case, the same argument gives, for the sum of two independent, discrete

random variables X ,Y with values in N, that

P(X +Y = n) = ∑k+l=n

P(X = k)P(Y = l).

As an immediate application, we obtain the following distributions of sums on independentrandom variables. Note that there are no comparable results for the uniform or exponentialdistribution, since the sum of two uniform random variables is no uniform random randomvariable, and the sum of two exponentially distributed random variables is no exponentiallydistributed random variable.11

Proposition 3.4.2. Assume that X ,Y are independent random variables. We then have for thedistribution of X +Y the following.

Distr. of X Distr. of Y Distr. of X +YN(µ1,σ

21 ) N(µ2,σ

22 ) N(µ1 + µ2,σ

21 +σ2

2 )B(n, p) B(m, p) B(n+m, p)Pois(λ ) Pois(λ ) Pois(2λ )

Proof. In here, we will only prove the third case:

P(X +Y = n) = ∑k+l=n

e−λ λ k

k!e−λ λ l

l!

=n

∑k=0

e−λ λ k

k!e−λ λ n−k

(n− k)!=

n

∑k=0

e−2λ λ n

k!(n− k)!=

n

∑k=0

e−2λ λ n

n!

(nk

)= e−2λ λ n

n!

n

∑k=0

(nk

)= e−2λ λ n

n!(1+1)n = e−2λ 2λ n

n!.

11However, if the exponential distribution is seen as an element of the family of Γ-distributions, then the sumstays within this class.

75

3.5 Expectation and varianceWe are now in position to introduce the notion of expectation, which then allows to defineexpected values and variances.

Definition 3.5.1. Let Y be a non-negative random variable. If Y is a discrete random variablewith values in the countable set J, then

E(Y ) := ∑k∈J

kP(Y = k).

Furthermore, if Y is a continuous random variable with density f , then

E(Y ) :=∫

x f (y)dy.

Note that in the above definition, E(Y ) is always defined, but might be infinite. However,as a standard fact from integration theory, it is known, that E(|X |) < ∞ implies that E(X) iswell defined and finite.

Definition 3.5.2 (Expectation). Let X be a random variable with E(|X |) < ∞, and E(X) de-fined as above. Then E(X) is called the expected value or expectation of X. Furthermore, forp ∈ N, E(|X |p) is called the p-th moment of X, and

Var(X) := E((X−E(X))2)

is called the variance of X.

For expectations, the following rules applies.

Proposition 3.5.3. Let X ,Y be random variables with E(|X |),E(|Y |) < ∞. Then

(i) E(λX) = λE(X) for all λ ∈ R.

(ii) E(X +Y ) = E(X)+E(Y ).

(iii) If X and Y are independent, then E(XY ) = E(X)E(Y ).

(iv) If X and Y are independent, then Var(X +Y ) = Var(X)+Var(Y ).

(v) If E(X2) < ∞, then E(|X |) < ∞.

(vi) If Var(X) < ∞, thenVar(X) = E(X2)− (E(X))2.

Example 34 (Expectation and variance of the binomial distribution). For a Bernoulli-distributedrandom variable with parameter p ∈ (0,1) we have

E(X) =1

∑k=0

kP(X = k) = p, and E(X2) =1

∑k=0

k2P(X = k) = p.

76

Hence Var(X) = p− p2 = p(1− p). For a B(n, p) distributed random variable X , the expec-tation and variance can be determined using the above proposition and the fact, that X is thesum of n independent Bernoulli random variables. Hence,

E(X) = np, Var(X) = np(1− p).

Example 35 (Expectation and variance of the Poisson distribution). So let X be a randomvariable with X ∼ Pois(λ ). Then

E(X) =∞

∑k=0

exp(−λ )k · λk

k!= λ exp(−λ )

∞

∑k=0

λ k

k!= λ ,

E(X2) =∞

∑k=0

exp(−λ )k2 λ k

k!

=∞

∑k=0

exp(−λ )k(k−1)λ k

k!+

∞

∑k=0

exp(−λ )kλ k

k!

= λ2 +λ .

Hence, expectation and variance are equal to λ .

Example 36 (Expectation and variance of the exponential distribution). So let X be a randomvariable with X ∼ Exp(λ ). Then, using partial integration

E(X) =∫

∞

0xλe−λxdx = −xe−λx

∣∣∣∞0

+∫

∞

0e−λxdx

= 0+1λ

E(X2) =∫

∞

0x2

λe−λxdx = −x2e−λx∣∣∣∞0

+∫

∞

02xe−λxdx

= 0+2λ

∫∞

0xλe−λxdx =

2λ 2

Hence, the expectation is equal to 1/λ , and the variance is equal to λ−2.

Example 37 (Expectation and variance of the uniform distribution). So let X be a randomvariable with X ∼U(a,b). Then

E(X) =a+b

2, Var(X) =

(b−a)2

12.

Example 38 (Expectation and variance of the normal distribution). So let X be a randomvariable with X ∼ N(µ,σ2). Then

E(X) = µ, Var(X) = σ2.

77

Applications

Example 39 (An application to overbooking by flight companies). Finally, we give a ‘reallife’-application of the theory done so far. So assume that an airline possesses a plane with400 seats. It is known from the past, that approximately 2% of the passengers do not check inin time. In order to use the plane more efficiently, one is now tempted to overbook the flightin a way, that only in rare cases, a passenger does not get a seat in the plane.Model 1. In order to model the number of passengers which are late, we use the Poissondistribution. So, given that n people bought a ticket, let Xn be the Pois(λn)-distributed randomvariable modelling the number of persons which are late. So we have to choose λn accordingto

E(Xn) = λn = n ·0.02.

So we are able to answer the following question. How many tickets should be sold such thatwith a probability of 95%, each passenger obtains a transfer. In our model, this is equivalentto find n such that

P(Xn ≤ n−400|n tickets sold) < 0.05

For each n, we hence obtain the following.

(i) For n = 401,

P(Xn ≤ n−400|n tickets sold) = P(Xn = 0|n tickets sold)+P(Xn = 1|n tickets sold)

= e−λn

(λ 0

n0!

+λ 1

n1!

)= e−λn (1+λn) .

(ii) For n = 402,

P(Xn ≤ n−400|n tickets sold) = e−λn

(λ 0

n0!

+λ 1

n1!

+λ 2

n2!

)= e−λn

(1+λn +λ

2n /2).

(iii) For n = 400+ k,

P(Xn ≤ k|n tickets sold) = e−λnk

∑i=0

λ in

i!.

This then gives rise to the following table.

k P(Xn ≤ k|n tickets sold)1 0.0029659572 0.013330963 0.040694404 0.09514295 0.18224656 0.29893867 0.43359558 0.57022539 0.6940543

10 0.79555

78

Model 2. Alternatively, it is possible to model this problem using a binomial random variables.In this case, Xn ∼ B(n, p), where p = 0.02. By the same arguments as above, we have to findn = 400+ k such that

P(Xn ≤ k|n tickets sold) =k

∑i=0

(ni

)pi(1− p)n−i.

However, to determine these probabilities, one also could apply the central limit theorem (orto be precise, the Moivre-Laplace theorem). Therefore, we have to consider

Zn :=Xn−np√np(1− p)

,

and use the fact, that Zn is approximately N(0,1) distributed. Using the transformation rule,we have

P(Xn ≤ k|n tickets sold) = P(Xn−np√np(1− p)

≤ k−np√np(1− p)

|n tickets sold).

In order to obtain a more precise result, one considers the same expression using a continuitycorrection:

P(Xn ≤ k|n tickets sold) = P(Xn−np√np(1− p)

≤k + 1

2 −np√np(1− p)

|n tickets sold).

This then gives rise to the following table.

k (k +1/2−np)/√

np(1− p) P(Xn ≤ k|n tickets sold)1 -2.325666 0.010018182 -1.973643 0.024211153 -1.622498 0.052348344 -1.272226 0.10164645 -0.9228217 0.17805016 -0.5742804 0.28288907 -0.2265973 0.41036848 0.1202322 0.54785049 0.4662129 0.6794684

10 0.8113493 0.7914175

Example 40. Assume one arrives at a random time at a bus stop, from where every 10 minutesleaves a bus (since one is not able to remember departure times). Then the waiting time untilthe next bus arrives is modelled by a uniform distribution on [0,10]. In particular, the expectedwaiting time is

110

∫ 10

0xdx =

110· 102

2= 5.

79

Example 41. Assume Peter leaves his office every day at a random time, goes to the bus stop,and waits for the next bus. At the bus stop, there are two bus lines A and B, where line A stopsat a bus stop near the apartment of Peter’s girl friend, and line B stops at a bus stop near thehouse of Peter’s mother. So, if next bus is from line A, then Peter visits his girl-friend afterwork, and in the other case his mother.

After some time, Peter notices that he visits his girl-friend 9-times more often than hismother. How is that possible, if the both lines provide a bus every 10 minutes?

So assume that the busses of line B arrives k minutes after the ones of line A, and denoteby T ∈ [0,10] the time when Peter arrives at the bus stop. Then Peter visits his girl friend if1[k,10](T ) = 1, and his mother, if 1[0,k](T ) = 1. Hence, the expected number of visits to his girlfriend is equal to

E(1[k,10](T )) =110

∫ 10

01[k,10](x)dx =

110

∫ 10

kdx =

10− k10

.

By the same argument, one obtains that the expected number of visits to his mother is equal to

E(1[0,k](T )) =110

∫ 10

01[0,k](x)dx =

110

∫ k

0dx =

k10

.

Hence, his mother is right if (10− k)/k = 9. Hence k = 1.

3.6 Limit theorems for sums of independent random vari-ables

In the sequel, we will consider finite or infinite sequences of independent, identically dis-tributed random variables X1,X2, . . ., where each of these random variables has the same dis-tribution. For abbreviation, ‘independent, identically distributed’ will be abbreviated by i.i.d .As examples for i.i.d. random variables we have e.g. tossing a coin n-times. Note that for i.i.d.random variables (Xi), we clearly have that E(X1) = E(X2) = · · · and Var(X1) = Var(X2) = · · ·

3.6.1 The law of large numbersThe first limit theorem asserts that the mean value of i.i.d. random variables converges to theirexpected value, which is quantity which is no longer random.

Theorem 3.6.1 ((Strong) law of large numbers). Let (Xn : n ∈ N) be a sequence of i.i.d. ran-dom variables such that E(|X1|) < ∞. Then

limn→∞

1n

n

∑i=1

Xi = E(X1)

with probability 1.

80

In some sense, the law of large numbers connects the intuition one has with respect toindependent repetitions of an experiment with random outcome to the probability theory de-veloped so far. This might be seen by the following example. For n ∈ N, let Xn refer to therandom variable of rolling a die the n-th time. Furthermore, let

Yn :=

1 Xn = 60 Xn = 1,2,3,4,5.

Hence, by the law of large numbers, it follows that

limn→∞

1n

n

∑i=1

Yi = E(Y1) = 0 ·P(Y1 = 0)+1 ·P(Y1 = 1) = P(X1 = 6) =16.

Since ∑ni=1Yi is the number of times up to time n, when the die showed 6, we obtain a precise

statement of the intuitive (and not precise) statement, that the number 6 occurs in average1/6-times. This simple idea will turn out to be useful in estimation theory in statistics.

3.6.2 The central limit theoremFrom the viewpoint of the law of large numbers, the central limit theorem might be seen as astatement for the case of i.i.d. random variables with E(X1) = 0 and Var(X1) = 1, where thenormalization 1

n is replaced by 1√n . It turns out, that with respect to this normalization, one

obtains convergence to a distribution instead of convergence to a point.

Theorem 3.6.2 (The central limit theorem). Let (Xn : n ∈ N) be a sequence of i.i.d. randomvariables such that E(X2

1 ) < ∞, and let µ := E(X1), σ :=√

Var(X1). Then

1√n

n

∑i=1

Xi−µ

σ

∗→ N(0,1).

Remarks.

(i) This theorem clarifies the significance of the normal distribution.

(ii) The precise definition of the convergence ∗ is the following. Let Y be a standard normaldistributed random variable. Then

limn→∞

E

(f

(1√n

n

∑i=1

Xi−µ

σ

))= E( f (Y ))

for each continuous function f : R→ R with bounded support (that is, there exists a,bdepending on f such that f (x) = 0 for all x /∈ [a,b]). A convergence of this type is calledweak convergence.

(iii) An important application of the central limit theorem is to obtain approximate values fordistributions of sums of independent variables. This applies in particular to the binomialdistribution, which can be seen as the sum of Bernoulli distributed random variables.

81

(iv) A further important application is the construction of asymptotic confidence intervals orasymptotic hypothesis tests.

Example 42. We now give an application of the central limit theorem for finding approxima-tive probabilities. So let X1, . . . ,X36 be i.i.d. random variables which serve as a model forrolling a fair die 36 times. Furthermore, let

A := 120,121, . . . ,138.

Using the central limit theorem one is now in position to determine P(X1 +X2 + · · ·+X36 ∈ A).Therefore, the first step is to determine the expectation and variance of X1.

E(X1) =72, E(X2

1 ) =916

, Var(X1) = E(X21 )− (E(X1))2 =

3512

.

Using the central limit theorem it follows that, with Φ referring to the distribution function ofthe normal distribution,

P(A) = P(120≤ X1 +X2 + · · ·+X36 ≤ 138) = P

(120≤

36

∑i=1

Xi ≤ 138

)

= P

(120−nE(X1)≤

36

∑i=1

(Xi−E(X1))≤ 138−nE(X1)

)

= P

(120−nE(X1)√

36Var(X1)≤ ∑

36i=1 Xi−E(X1)√

36Var(X1)≤ 138−nE(X1)√

36Var(X1)

)

= P

(120−126)6√

35/12≤ 1√

n

36

∑i=1

Xi−E(X1)√Var(X1)

≤ 138−1266√

35/12

)

= P

−0.585≤ 1√n

36

∑i=1

Xi−E(X1)√Var(X1)︸︷︷︸≈ N(0,1)≤ 1.17

≈Φ(1.17)−Φ(−0.585) = Φ(1.17)− (1−Φ(0.585)) = 0.591.

Continuity correction If a discrete random variable is approximated with a continuous one,then a so called continuity correction provides better approximations.

The reason for that is the following. So assume that in the above example, one would liketo approximate P(∑Xi = 120). Then the method above will give the value 0 as approximation.But if one considers

P(120− 12≤∑Xi ≤ 120+

12),

then the approximation will be positive. However note, that a approximation of this kind isin general not very acccurate.12 By considering approximations for events like A (‘intervals’),

12The accuracy depends on the derivative of Φ at that point. In fact, better approximations can be obtained by

82

these inaccuracies no longer matter, since the inaccuracies in each single point sum up to aaccurate approximation.

In the above calculation this is applied as follows. By replacing 120 with 120− 12 , and 138

with 138+ 12 , one obtains P(A)≈ 0.626, which is a good approximation for the true probability

0.625.

implementing the derivative in the continuity correction. Just for demonstration, the following probabilities areapproximated using the continutiy correction:

P(∑Xi = 120)≈ 0.0328, P(∑Xi = 90)≈ 8.17 ·10−5.

Their true values areP(∑Xi = 120) = 0.0327, P(∑Xi = 90) = 6.8 ·10−5.

83

Chapter 4

Statistics (ESM 2A)

The aim of Statistics is to give an answer to a given question (like ’Is there a difference betweentreatment A or B’, ’how big is the deviation in quality in production’, etc.) by designing andevaluating an experiment.

So assume that one has observed the data x1, . . .xn in an experiment. This is called a sample.In order to perform a statistical analysis, the first step is to model this sample this outcome assequence X1, . . .Xn of random variables from an unknown distribution. For ease of notation,this is also called a sample. Furthermore, in most cases one tries to design the experimentsuch that one may assume that the random variables are i.i.d, and additionally assumes thatthe distribution is an element of a known class of distributions, e.g. from a normal distributionwith unknown µ and σ , or a Poisson distribution with unknown parameter λ etc.

The next step is to transform the data such that it is accesible to the analysis. In here, thisis done via a map

T : Rn→ R,

which then gives rise to a random variable T (X1, . . . ,Xn) which is called a statistic. In order toproceed, one is in need of information about the distribution of T (X1, . . . ,Xn). If T (X1, . . . ,Xn)tends to a (non-random) element of R as n→ ∞, then T (X1, . . . ,Xn) might be used as anestimator for a parameter of the known class of distributions (e.g. the arithmetic mean tendsto the expected value by the law of large numbers). On the other hand, if the distributionof T (X1, . . . ,Xn) is known, one may use it for the construction of a confidence interval or ahypothesis test.

4.1 Estimators

4.1.1 Estimators for expectation and varianceUsing the law of large numbers, one immediately obtains the following estimators for theexpected value and the variance. So assume that X1, . . .Xn is an i.i.d. sample from a distributionwith existing expected value. Since the X1, . . .Xn is an i.i.d., it follows that E(|X1|) = · · · =E(|Xn|). Hence the assumption, that the expectation exists, is equivalent to E(|X1|) < ∞.

84

(i) Let X1, . . .Xn be an i.i.d. sample with E(|X1|) < ∞. Then the statistic

Xn ≡ X :=1n

n

∑i=1

Xi (4.1)

is called the (arithmetic) mean. By the law of large numbers,

limn→∞

Xn = E(X1)

with probability 1. Hence the arithmetic mean is an estimator for the expected value.

(ii) Let X1, . . .Xn be an i.i.d. sample with E(X21 ) < ∞, and assume that µ := E(X1) is known.

Then

limn→∞

1n

n

∑i=1

(Xi−µ)2 = Var(X1). (4.2)

Hence, this is an estimator for the variance.

Proof. By the law of large numbers, the above limit is equal to

E((X1−µ)2) = Var(X1).

(iii) Let X1, . . .Xn be an i.i.d. sample with E(X21 ) < ∞, and assume that E(X1) is unknown,

and let

S2n :=

1n−1

n

∑i=1

(Xi−Xn)2.

Sincelimn→∞

S2n = Var(X1), (4.3)

this is an estimator for the variance in this case.

Proof. For the estimator n−1n S2

n, we have

1n

n

∑i=1

(Xi−Xn

)2 =1n

n

∑i=1

X2i −2Xi

(1n

n

∑j=1

X j

)+

(1n

n

∑i=1

Xi

)2

=1n

n

∑i=1

X2i −

2n

(n

∑i=1

Xi

)2

+1n

(n

∑i=1

Xi

)2

=1n

n

∑i=1

X2i −

1n2

(n

∑i=1

Xi

)2

=1n

n

∑i=1

X2i −

(Xn)2

.

Hence, by the law of large numbers, lim n−1n S2

n = E(X21 )− (EX1)2. In particular, since

lim n−1n = 1, the assertion follows.

85

Remarks. An estimator for a parameter which converges to that parameter is called consistent.Hence by Equations (4.1), (4.2) and (4.3), the above estimators are all consistent.1 A furthercharacterisation of an estimator is the following. Let T be an estimator for the parameter θ .Then the bias defined as E(T )−θ , and the estimator is called unbiased, if the bias is equal to0, or equivalently,

E(T ) = θ .

For the first and second esimator above, it is easy to see that they are unbiased. For the third,using the above calculation, one obtains

E

(n

∑i=1

(Xi−Xn

)2

)= E

(n

∑i=1

X2i

)− 1

nE

(n

∑i=1

Xi

)2

= nE(X21 )− 1

n

n

∑i, j=1

E(XiX j)

= nE(X21 )− 1

n

(n

∑i=1

E(X2i )+ ∑

i 6= jE(XiX j)

)

= (n−1)E(X21 )− 1

n ∑i6= j

E(Xi)E(X j)

= (n−1)E(X21 )− (n−1)(E(X1))2.

Hence, S2n is also an unbiased estimator.

Example 43. So assume one knows that a waiting time is Exp(λ )-distributed. In order tomodel the waiting time by a random variable, one designs an experiment with repeated mea-surements of the waiting time, such that the observations are independent. So assume that oneobtains measurements x1, . . . ,xn. By calculating the arithmetic mean of x1, . . . ,xn, one thenarrives at an approximate value λ for λ , that is

λ :=n

x1 + · · ·+ xn,

due to the fact, that limXn = E(X1) = 1/λ . Note that the speed of approximation can bededuced from Proposition 4.2.3.

4.1.2 Estimators for quantilesInformally, a quantile refers to the inverse of the distribution function. We will now give threedefinitions in increasing generality.

So assume that F is a continuous and strictly increasing distribution function (e.g. like thedistribution function of the normal distribution). Then F−1 : [0,1]→R∪±∞ exists, and forα ∈ [0,1],

qα := F−1(α)1It seems not reasonable to use estimators which are not consistent. However, in rare cases, a non-consistent

estimator will perform better than a consistent one.

86

is called the α-quantile of the distribution given by F . If F is continuous but not strictlyincreasing (e.g. like the distribution function of the exponential distribution), then

qα := maxx ∈ R : F(x) = α

is called the α-quantile in this case. For illustration, the 0-quantile of the exponential distri-bution is equal to 0. Finally, if there are no further assumptions on the distribution function F ,we have the following definition:

qα := supx ∈ R : F(x)≤ α.

From an i.i.d. sample x1, . . .xn from a distribution with unknown distribution function F ,it is now possible to obtain estimators for the quantiles. Historically, one is interested inestimators for q0.25, q0.5 and q0.75. In order to define these estimators, the data has to beordered according to the magnitude. That is, let x(i) refer to the ith-smallest value of x1, . . .xn.Hence one arrives at the order statistic

x(1),x(2), . . . ,x(n) with x(1) ≤ x(2) ≤ ·· · ≤ x(n).

The median of the sample is now defined asx((n+1)/2) noddx(n/2)+x(n/2+1)

2 n odd.

The lower quartile is defined by x(〈n/4]), and the upper quartile is defined by x([3n/4]), where[·] refers to the biggest integer which is smaller than or equal to ·. Note that these definitionsare not universal, e.g. the lower quartile is in some textbooks defined as x(〈n/4]+1), or as(x(〈n/4]) + x(〈n/4]+1))/2. The lower quartile, the median and the upper quartile are sometimesalso called first, second and third quartile.

Note that, by the Glivenko-Cantelli-theorem, the quartiles converge to the quantiles as thenumber of observations tend to ∞.

Example 44. Let

x1 = 9, x2 = 8, x3 = 3, x4 = 8, x5 = 4, x6 = 4, x7 = 3,

x8 = 3, x9 = 2, x10 = 6, x11 = 1, x12 = 4.

The order statistic is then given by

x(1) = 1, x(2) = 2, x(3) = 3, x(4) = 3, x(5) = 3, x(6) = 4, x(7) = 4,

x(8) = 4, x(9) = 6, x(10) = 8, x(11) = 8, x(12) = 9.

Hence the median is given by (x(6) + x(7)/2 = 4, and the quartiles are given by x(3) = 3 andx(9) = 6. Note that the above sample was generated from the uniform sample space withΩ = 1,2,3, . . .9, and hence the real parameters are

q0.25 = 2, q0.5 = 4, q0.75 = 6.

87

4.2 Statistics and their distributionsIn case of normal distributed samples, the distribution of several statistics is known.

Proposition 4.2.1. Let X1, . . .Xn i.i.-N(µ,σ2)-distributed random variables. Then

∑ni=1 (Xi−µ)2

σ2 ∼ χ2n .

In here, χ2n refers to the χ2-distribution (’chi2’) with n degrees of freedom.2 As in case

of the normal distribution, the density functions, for n = 1,2, . . . are known, but the relateddistribution functions can only be accessed numerically. For the graphs of the densities, seeFigure 4.1. Note that the above Proposition only may be used, if µ is known. For the general

0 2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

0.5

x

dchi

sq(x

, 2)

Figure 4.1: Graphs of χ2n -densities, for n = 2, n = 5 (blue), n = 5 (red).

case, we have the following.

Theorem 4.2.2 (Cochran’s theorem). Let X1, . . .Xn i.i.-N(µ,σ2)-distributed random variables.Then

(n−1)S2n

σ2 =∑

ni=1(Xi−Xn

)2

σ2 ∼ χ2n−1.

2The definition of the χ2n -distribution is the following. Let U1, . . . ,Un be i.i.d. random variables with Ui ∼

N(0,1). Then the χ2n -distribution is defined as the distribution of

U1 + · · ·+Un.

88

The proof of Cochran’s theorem makes use of the Γ-distribution and moment generatingfunctions, and hence will be omitted. The following results might be seen as immediate con-sequences of this theorem. From the proof, it is even possible to derive a distributional resultfor the exponential distribution.

Proposition 4.2.3. Let X1, . . .Xn i.i.-Exp(λ )-distributed random variables. Then

2λ

n

∑i=1

Xi ∼ χ22n.

The following result will be used frequently in the sequel. Note that the distribution of theestimator only depends on µ .

Proposition 4.2.4. Let X1, . . .Xn i.i.-N(µ,σ2)-distributed random variables. Then

X−µ

Sn/√

n∼ tn−1.

In here, tn refers to student’s tn-distribution with n− 1 degrees of freedom.3 As in caseof the normal distribution, the density functions, for n = 1,2, . . . are known, but the relateddistribution functions can only be accessed numerically. For the graphs of the densities, seeFigure 4.2.

Proposition 4.2.5. Let X1, . . .Xn,Y1, . . .Yn are independent distributed, where X1, . . .Xn arei.i.-N(µX ,σ2)-distributed, and Y1, . . .Yn i.i.-N(µY ,σ2)-distributed. Then

S2X ,n

S2Y,n∼ Fn−1,m−1.

In here, Fn,m refers to the Fn,m-distribution with degrees of freedom n and m.4 As in case ofthe normal distribution, the density functions are known, but the related distribution functionscan only be accessed numerically. For the graphs of the densities, see Figure 4.3.

3The definition of the tn-distribution is the following. Let U,C be independent random variables with X ∼N(0,1) and Y ∼ χ2

k . Then the tn-distribution is defined as the distribution of

X√Y/n

.

4The definition of the Fn,m-distribution is the following. Let C1,C2 be independent random variables withC1 ∼ χ2

n and C2 ∼ χ2m. Then the Fn,m-distribution is defined as the distribution of

C1/nC2/m

89

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

x

dt(x

, 20)

Figure 4.2: Graphs of tn-densities, for n = 5, n = 10 (blue), n = 20 (red), and the density ofthe standard normal distribution (dotted).

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

x

df(x

, 20,

20)

Figure 4.3: Graphs of Fn,m-densities, for n,m = 5 (blue), n,m = 10 (red), n,m = 20.

90

4.3 Confidence intervalsAs a first application of the results on the distributions of several statistics, we may answerthe following question for several cases: After having observed n independent samples from adistribution, is it possible to determine an interval such that a parameter of the distribution isin that interval with high probability. An interval of that type is called confidence interval.

For the next four constructions of a confidence interval, assume that X1, . . . ,Xn are n inde-pendent samples from a N(µ,σ) distribution.

Confidence interval 1 (Confidence interval for µ , where σ is known). Using the fact that thesum of independent normal variables is normally distributed (see Proposition 3.4.2), it followsthat

X ∼ N(µ,σ2/n).

Hence√

n(X−µ)/σ ∼ N(0,σ2/n). In particular, for given α ∈ (0,1) (usual values for α are0.1, 0.05 or 0.095), it follows that

1−α = P(−Φ−1(1−α/2)≤

√n(X−µ)

σ≤Φ

−1(1−α/2))

= P(X− σ√n

Φ−1(1−α/2)≤ µ ≤ X +

σ√n

Φ−1(1−α/2)).

Hence, we obtained that the parameter µ of the distribution is with probability 1−α an ele-ment of the interval [

X− σ√n

Φ−1(1−α/2),X +

σ√n

Φ−1(1−α/2)

].

This interval is called a (1−α) confidence interval for µ .

By similar considerations and applying Proposition 4.2.4 we obtain a confidence intervalfor µ if σ is unkown.

Confidence interval 2 (Confidence interval for µ , where σ is unknown). Using Proposition4.2.4, we have

P(

X− Sn√n

F−1tn−1

(1−α/2)≤ µ ≤ X +Sn√

nF−1

tn−1(1−α/2)

)= 1−α,

where Ftn−1 refers to the distribution function of student’s t-distribution with n−1 degrees offreedom.

Example 45 (Confidence interval for 10 normal distributed samples). Assume that we havethe following 10 independent samples from a normal distribution.

X1 = 79.1, X2 = 79.3, X3 = 78.2, X4 = 75.2, X5 = 71.6, X6 = 71.8, X7 = 91.3, X8 = 76.2, X9 = 69.3, X10 = 84.1

By straight forward calculations, we obtain

X = 77.61,10

∑i=1

(Xi−X) = 381.689, S2n = 42.40989.

91

For α = 0.05, a table gives F−1t9 (1−α/2) = F−1

t9 (0.975) = 2.262157. Hence the confidenceinterval is equal to[

77.61−√

42.40989√10

·2.262157,77.61+√

42.40989√10

·2.262157

]=[77.61−4.65861,77.61+4.65861] = [72.95139,82.26861] .

For α = 0.1, we obtain, using F−1t9 (0.95) = 1.833113, that the 90%-interval is

[73.83495,81.38505].

Finally, it is worth noting, that the above numbers were generated using a random numbergenerator, and are samples of a N(80,92)-distribution. In this example, the true parameter isnot contained in the 70%-confidence interval.

Furthermore, using Proposition 4.2.1 and Theorem 4.2.2, we obtain confidence intervalsfor σ

Confidence interval 3 (Confidence interval for σ , µ known). By Proposition 4.2.1, we havethat

1σ2

n

∑i=1

(Xi−µ)2 ∼ χ2n .

With Fχ2n

referring to the distribution function of the χ2n - distribution, we have

1−α = P

(F−1

χ2n

(α/1)≤ 1σ2

n

∑i=1

(Xi−µ)2 ≤ F−1χ2

n(1−α/2)

)

= P

(∑

ni=1(Xi−µ)2

F−1χ2

n(α/2)

≥ σ2 ≥ ∑

ni=1(Xi−µ)2

F−1χ2

n(1−α/2)

)

Note that this interval is in fact a (1−α)-confidence interval. However, by replacing α/2 and1−α/2 with α1 and 1−α2 with α1 + α2 = α one might end up with a shorter confidenceintervals. This is due to the fact that the χ2-distribution is not symmetric. The correspondingvalues can be found by minimising numerically the expression

1F−1

χ2n

(α2)− 1

F−1χ2

n(1−α1)

.

However, for n≥ 20, α1 = α2 = α/2 leads to reasonable results.

Confidence interval 4 (Confidence interval for σ , µ unknown). Cochran’s theorem states that

n−1σ2 S2

n ∼ χ2n−1.

92

For α ∈ (0,1), we hence have

1−α = P(

F−1χ2

n−1(α/2)≤ n−1

σ2 S2n ≤ F−1

χ2n−1

(1−α/2))

= P

(n−1)S2n

F−1χ2

n−1(α/2)

≥ σ2 ≥ (n−1)S2

n

F−1χ2

n−1(1−α/2)

Note that, for n < 21, replacing α/2 and 1−α/2 with α1 and 1−α2 with α1 +α2 might leadto shorter confidence intervals (see above).

Example 46. Using the same data as in Example 45, we already have seen that

(n−1)S2n = 381.689.

For α = 0.05, we then obtain, using a table, that

F−1χ2

9(α/2) = 2.700389, F−1

χ29

(1−α/2) = 19.02277.

Hence a 95%-confidence interval is given by[381.689

19.02277,

381.6892.700389

]= [20.06485,141.3459].

We now include several examples for the choice of α1,α2 with α1 +α2 = α . So let α1 = 0.05,α2 = 0, we obtain

F−1χ2

9(0.05) = 3.325113, F−1

χ29

(1) = ∞.

Hence a 95%-confidence interval is given by[0,

381.6893.325113

]= [0,114.7898],

which is a shorter interval the original one. In fact, the confidence interval for α1 = 0.048,α2 = 0.002 is the shortest interval:

F−1χ2

9(0.048) = 3.283226, F−1

χ29

(0.998) = 26.05643.

Hence the corresponding interval is given by

[14.64855,116.2543].

Finally, using the central limit theorem, one is able to find an approximate confidenceinterval for X ;1, . . . ,Xn i.i.d with E(X2

1 ) < ∞.

93

Confidence interval 5 (Asymptotic confidence interval for E(X), where E(X2) < ∞.). Setσ2 := Var(X). Using the central limit theorem, we have

∑ni=1 Xi−µ√

nσ=

X−µ

σ

√n−→U ∼ N(0,1).

If σ2 is not known, then it is possible to replace σ by the estimator S2n without changing the

convergence5. Hence,∑

ni=1 Xi−µ√

nSn

L−→U ∼ N(0,1),

and we obtain the following asymptotic confidence interval.

P(X− Sn√n

Φ−1(1−α/2)≤ µ ≤ X +

Sn√n

Φ−1(1−α/2)) = 1−α.

4.4 Hypothesis testsA hypothesis test could be informally described as a test whether it is possible to obtain acontradiction with high probability of a hypothesis given the observed data.

The general framework of an hypothesis test is the following.

(i) There are x1, . . .xn observations, which are modelled by random variables X1, . . .Xn.

(ii) A hypothesis H0 is formulated in terms of the distribution of the X1, . . .Xn (e.g. H0:E(Xi) = 0). Furthermore, the alternative is formulated as the complement of H0, and isdenoted by H1.

(iii) Finally, one considers a decision function φ : Rn→0,1 with the following meaning.

φ(x1, . . .xn)

1 the hypothesis is rejected0 the hypothesis is not rejected.

Note that φ(X1, . . .Xn) is modelled as a random variable having its specific distribution,whereas φ(x1, . . .xn) is just a point.

(iv) In this framework, there are the following possible outcomes.

φ = 1 φ = 0H0 ∗ okH1 ok ∗∗

In here, ∗ refers to a type I error, and ∗∗ to a type II error. Furthermore, probability ofa type I error is referred to as the level of significance of the test.

5This is a consequence of Slutzky’s Lemma.

94

Remarks.

(i) This decision problem is not symmetric, that is H0 is formulated with the aim to fal-sify the H0. In particular, falsifying H0 implies that H1 can be accepted (with givensignificance level). But ‘H0 is not rejected’ does not imply that H0 can be accepted.

(ii) The level of significance is traditionally a value like 0.025, 0.05, 0.1 or 0.2. Note that thechoice of the level should be made in advance. However, finding a meaningful choicedepends on several parameters - the bigger the number of observations, the smaller thelevel might be chosen. Furthermore, decreasing the significance level always impliesthat the probability of a type II error is increasing.

Example 47 (Testing whether a coin is loaded). The idea of finding a decision with a pre-scribed ‘error’ probability is to reject an assumption or hypothesis with a prescribed probabil-ity.

In here, we choose the hypothesis ‘the coin is fair’, and we try to reject this with an errorprobability of 15%.

Under the assumption that the coin is fair, we know that the number of ‘heads’ whichoccurred after tossing the coin n-times is B(n,1/2)-distributed. This leads to the followingexperiment: if we toss the coin 24 times, then one may argue that the number of ‘heads’ isapproximately, using the law of large numbers, equal to 1/2 ·24 = 12. However, it is not clearhow much the value obtained in the experiment (i.e. number of heads) might differ from 12.

However, the distribution of the number of ‘heads’ is known - in particular, it is possible todetermine the probability

P(number of ‘heads’ is in 12− i,12− i+1, . . . ,12+ i).

By calculation, one obtains that

P(number of ‘heads’ is in 12−4,12−4+1, . . .12+4)=P(number of ‘heads’ is in 8, , . . .16) = 0,8922479.

Now, assume after tossing the coin 24 times, the number of tails is not in 8, , . . .16. Thenone my draw the following conclusion: with probability 0.85, the hypothesis that the coin isfair is wrong. However, if the number of tails is in 8, , . . .16, the hypothesis might not berejected.

For practical use, statistical software packages provide the so called p-value, which is de-fined for the given data as the smallest value for 1−α such that the hypothesis can be rejectedat significance level α . However, one should take some care with the interpretation of thisvalue. From the viewpoint of decision theory, the procedure of a statistical test is the follow-ing.

(i) The statistical test to be applied (or at least a family of tests) as well as the level ofsignificance should be fixed in advance (in particular, there are formulae relating thelevel of significance, the probability of a type II error, and the number of observations).

95

(ii) Collect the data.

(iii) Perform a statistical analysis. Reject the hypothesis or not with the given significancelevel. Here, the p-value comes into play: if the p-value is bigger than 1−α , the hypoth-esis can be rejected to the level α .

Clearly, this is only theory - in most cases, data is collected in advance, and then the statisticalmethods are adapted to the data (especially using p-values). However, this clearly changes theactual level of significance. However, e.g. in extensive (and expensive) clinical studies, thetheoretical procedure above is used in order to control the significance.

In the example in the introduction, this applies as follows. The aim is to answer whetherthe starting salaries of students from different universities differ significantly. Therefore, onedecides to apply a t-test with 10 observations in each group. Since this number is not that big,one decides that the level of significance is 20%. After collecting and analysing the data, itturns out that the p-value is 0.6707. Since 0.8 > 0.6707, it follows that the hypothesis can’t berejected to the given level. However, this does not imply that there is no significant difference,but one might conclude that a difference could be revealed using more observations.

4.4.1 Hypothesis tests for univariate dataIn here, we will now present several tests associated with an i.i.d.-sample X1, . . .Xn.

Hypothesis test 1 (two-sided t-Test). Let X1, . . .Xn be i.i.-N(µ,σ2) distributed random vari-ables with σ unknown. Furthermore, consider the following decision problem.

H0 : µ = µ0 vs. H1 : µ 6= µ0,

and assume that x1, . . . ,xn ∈ R is an i.i.d. sample from the distribution above, and let

φ(x1, . . . ,xn) :=

1∣∣∣√n(x−µ0)

sn

∣∣∣≥ tn−1,1−α

2

0∣∣∣√n(x−µ0)

sn

∣∣∣< tn−1,1−α

2,

where tn,1−α

2refers the 1− α

2 -quantile of the tn−1-distribution. Then φ is a hypothesis testsignificance level α .

Proof. So assume that X1, . . .Xn are i.i.-N(µ0,σ2) distributed (this is H0). Using Proposition

4.2.4, we are able to determine the probability of a type I error.

P(∣∣∣∣√n(X−µ0)

Sn

∣∣∣∣≥ tn−1,1−α

2

)= P

(√n(X−µ0)

Sn≤−tn−1,1−α

2

)+P

(√n(X−µ0)

Sn≥ tn−1,1−α

2

)= P

(√n(X−µ0)

Sn≤ tn−1, α

2

)+P

(√n(X−µ0)

Sn≥ tn−1,1−α

2

)=

α

2+

α

2= α.

96

Hypothesis test 2 (one-sided t-test). In the situation of the t-test above, consider

H0 : µ ≤ µ0 vs. H1 : µ > µ0.

Then

φ(x1, . . . ,xn) :=

1∣∣∣√n(x−µ0)

sn

∣∣∣≥ tn−1,1−α

2

0∣∣∣√n(x−µ0)

sn

∣∣∣< tn−1,1−α

2,

is a hypothesis test of significance level α .

Proof. As above, but including a continuity argument.

Hypothesis test 3 (Test on the parameter of an exponential-distribution). Let X1, . . .Xn i.i.-Exp(λ )-distributed, and let

H0 : λ ≤ λ0 vs. H1 : λ > λ0.

Fur x1, . . . ,xn ∈ R, sei

φ(x1, . . . ,xn) :=

1 2λ0 ∑

ni=1 xi ≥ q

χ22n,1−α

0 2λ0 ∑ni=1 xi < q

χ22n,1−α

,

where qχ2

2n,1−αrefers to the 1−α-quantile of the χ2

2n-distribution. Hence, by applying Propo-sition 4.2.3, φ is a hypothesis test of level α .

4.4.2 Hypothesis tests for bivariate dataIt is often useful to compare data which comes in to groups - like ‘placebo/drug’, ‘treatmentA/treatment B’, or ‘before/after treatment’. In here, one differs between unrelated samples,that is the outcome in the groups are independent, and otherwise they are called associatedsamples.

Hypothesis tests for unrelated samples

Throughout this section, assume that X1,X2, . . . ,Xn,Y1 . . . ,Ym are independent, and Xi∼N(µX ,σ2X),

Yj ∼ N(µY ,σ2Y ) for i = 1, . . . ,n and j = 1, . . .m. Furthermore, denote by X , S2

n,X , Y , S2n,Y the

corresponding estimators in each group.

Hypothesis test 4 (Student’s t-Test for unrelated samples). Assume that σ2 = σ2X = σ2

Y , butσ2 is unknown. So consider

H0 : µX = µY vs. H1 : µX 6= µY .

Then

φ :=

1 : |T | ≥ tn+m−2,1−α/2

0 : |T |< tn+m−2,1−α/2.

97

is a test of significance level α , where tn+m−2,1−α/2 refers to the 1− α

2 -quantile of the tn+m−2-distribution, and (for N := n+m)

S2N :=

1n+m−2

(n

∑j=1

(Xi−X)2 +m

∑j=1

(Yi−Y )2

)

T :=√

nmn+m

X−YSN

Hypothesis test 5 (The Behrens-Fisher problem). If in the situation of Student’s t-Test forunrelated samples, it is not known in advance that σ2

X = σ2Y , only approximate methods are

available.6 The Welch-Satterthwaite method is the state of art.

Hypothesis test 6 (The F-test). The F-test is a test on equality of variances. That is,

H0 : σ2X = σ

2Y vs. H1 : σ

2X 6= σ

2Y .

Then

φ :=

1 :S2

n,X

S2m,Y

/∈ (qα/2,q1−α/2)

0 : else.

is a test of significance level α , where qκ refers to the κ-quantile of the Fn−1,m−1-distribution.

Hypothesis tests for associated samples

Associated samples occur, if there are repeated measurements at the same unit, e.g. if the datais organized as follows.

unit before after treatment1 X11 X12...

......

n Xn1 Xn2

In this case, the bivariate data can be transformed into univariate data by considering Xi1−Xi2,or Xi1/Xi2.

4.4.3 χ2-testsIn here, we will present hypothesis tests where the associated test statistic is χ2-distributed.Note that these tests rely on the assumption that the underlying random variables have a finiteset of possible outcomes. However, to apply these methods to continuous data, one mayclassify the data with respect to .

6In this situation, the data is called heteroscedastic. If the variances are assumed to be equal then the data iscalled homoscedastic.

98

Pearson’s χ2-test

Assume that X1,X2, . . . ,Xn are n i.i.d. random variables, where each of them has possibleoutcomes 1,2, . . . ,k with probabilities

P(Xi = l) = pl,

where pl ∈ (0,1) for l = 1, . . .k, and p1 + p2 + · · ·+ pk = 1. In order to define the test statistic,let Fi be the frequency of the event i (that is the total number of outcomes of i), and

Tn :=k

∑i=1

(Fi−npi)2

npi.

Note that the frequencies then are distributed acccording to a multinomial distribution withparameters (p1, p2, . . . pk). Then the following holds:7

Tn→ χ2k−1.

The corresponding test is then defined as using the approximate distribution. We hence obtainthe following test.

(i) H0 : (p1, p2, . . . pk) = (p(0)1 , p(0)

2 , . . . p(0)k )

(ii) The test statistic is given by

T :=k

∑i=1

(Fi−np(0)i )2

np(0)i

.

(iii) The hypothesis can be rejected with level 1−α , if

T ≥ q1−α ,

where q1−α refers to the 1−α-quantile of the χ2k−1-distribution.

Example 48. Assume that, after tolling a die 24 times, we have the following frequencies:

i 1 2 3 4 5 6Fi 2 4 6 4 3 5

We are now in position to perform a test on the hypothesis that the die is fair with significancelevel 95%. So let

H0 : (p1, p2, . . . p6) =(

16,16,16,16,16,16

).

This gives (using npi = 24/6 = 4 )

T =(−2)2

4+

04

+22

4+

04

+(−1)2

4+

14

=52.

Since q0.95 = 11.071, the hypothesis can not be rejected.7A heuristic proof is the following: Show that F1,F2, . . .Fk−1 are asymptotically independent, and then apply

the central limit theorem and the definition of the χ2-distribution. Unfortunately, a real proof is a bit moreinvolved.

99

The χ2-test for independence

The aim of this test is to check whether two effects are independent. Before formulatingthe corresponding mathematical objects, we will give an example where this method applies.Assume we know collect the following data from 72 persons, and want to know whether thesalary is dependent from the gender:

(i) Classified data of the salary (per year): less than 30000, between 30000 and 40000,between 40000 and 50000, between 50000 and 60000, more than 60000.

(ii) gender: male or female.

This data then can be arranged in a so called contingency table, where each cell contains thefrequency of the corresponding combination salary/gender:

≤ 30000 30001−40000 40001−50000 50001−60000 > 60001female 3 12 7 10 4male 5 10 12 8 3

This is modelled as follows. Let X ,Y be discrete random variables with possible set ofoutcomes 1, . . . ,k and 1, . . . , l, respectively. So let ((Xi,Yi) : i = 1, . . .n) be independentsamples of the random element (X ,Y ), and let npq be the frequency of the event [X = p,Y = q].Furthermore, let

n∗ j :=k

∑i=1

ni j, ni∗ :=l

∑j=1

ni j.

This objects can be arranged in the following contingency table.

1 2 · · · k−1 k ∑

1 n11 n21 nk1 n∗12 n12 n22 nk2 n∗2...l n1l n2l nkl n∗l∑ n1∗ n2∗ nk∗ n

The construction of the test now relies on the following distributional result. Note that ni jshould be approximately equal to nP(X = i,Y = j), which is the expected frequency in cell i j.So assume that that X and Y are independent. Then

n∗i j :=ni∗n∗ j

n

are also estimators for the expected frequency. Furthermore, it is known that

T :=k

∑i=1

l

∑i=1

(ni j−n∗i j

)2

n∗i j

100

is asymptotically χ2(k−1)(l−1)-distributed. Note that, in order to obtain a good approximation

the numbers n∗i j should be greater than or equal to 4. If this is not the case, then one mayobtain that property by joining classes with low expected frequencies.

The test is then constructed from this result as in the case of Pearson’s χ2-test.

(i) H0 : X and Y are independent.

(ii) The hypothesis can be rejected with level 1−α , if

T ≥ q1−α ,

where q1−α refers to the 1−α-quantile of the χ2(k−1)(l−1)-distribution.

4.4.4 Single factor analysis of varianceThe analysis of variance applies to a problem which can be modelled by independent randomvariables Yi j, for 1≤ i≤ k, and 1≤ j≤ ni. One may have several applications for this situation.

(i) For each i, Y· j corresponds to single experiment. By considering Yi j, one is able toperform multiple comparisons.

(ii) A further interpretation is factor analysis - each value of j corresponds to the value of acertain factor, which may have influence or not.

In here, we will consider the case, where the Yi j can be written as

Yi j = µ +αi + εi j,

where αi ∈ R with ∑ki=1 αi = 0, and εi j are i.i.N(0,σ2)-distributed random variables. An

hypothesis of interest isH0 : αi = 0 for all i = 1,2, . . . ,k,

since it has the following interpretations with respect to the above applications.

(i) The expected values of the groups are equal

(ii) The factor has no influence.

In order to proceed, we will need the following notions, where N := n1 + · · ·+nk

Y i :=1ni

ni

∑j=1

Yi j, Y :=1N

k

∑i=1

ni

∑j=1

Yi j

S2D :=

k

∑i=1

ni

∑j=1

(Yi j−Y i)2, S2A :=

k

∑i=1

ni

∑j=1

(Y i−Y )2

We then have thatS2

Ak−1S2

DN−k

∼ Fk−1,N−k.

101

Proof. By Cochran’s theorem,

ni

∑j=1

(Yi j−Y i)2/σ2 ∼ χ

2ni−1

for all i. By the definition of the χ2-distribution, it follows that S2D ∼ χ2

N−k.

4.5 Linear regressionLinear regression is a special case of a linear model. In here, one assumes that a responsevariable y is a linear combination of k known parameters x1,x2, . . .xk plus an error term. Thatis,

y = β0 +β1x1 + · · ·+βkxk + ε,

where ε is a (random) error term with E(ε) = 0. The aim in here is now to use subsequentmeasurements (yi,x

(i)1 ,x(i)

2 ,x(i)3 , . . . ,x(i)

k ) to find approximate values for the β j’s. These equa-tions can be rewritten using matrix multiplication:

y1y2...

yn

=

1 x(1)

1 x(1)2 · · · x(1)

k

1 x(2)1 x(2)

2 · · · x(2)k

......

1 x(n)1 x(n)

2 · · · x(n)k

β0β1...

βk

+

ε1ε2...

εn

,

where ε1, . . .εn refer to i.i.d. error terms with E(εi) = 0. With

Y :=

y1y2...

yn

, X :=

1 x(1)

1 x(1)2 · · · x(1)

k

1 x(2)1 x(2)

2 · · · x(2)k

......

1 x(n)1 x(n)

2 · · · x(n)k

, β :=

β0β1...

βk

, ε :=

ε1ε2...

εn

,

the above equation can be rewritten as Y = Xβ + ε . In here, (Y ) is called response vector, β

is called parameter vector, and X is called the design matrix.In order to find suitable approximations for β =(β0, . . .βk)T , one tries to find β :=(β0, . . . βk)T

such that‖Y−Xβ‖2

is minimal (recall that ‖ · ‖ refers to the length of a vector). This method is called the methodof least squares, and β can be found using vector analysis. Recall that

∂‖Y−Xβ‖2

∂βi= 0

for i = 1, . . .k is a necessary condition for β 7→ ‖Y−Xβ‖2 attaining its minimum at β . In fact,the following holds.

102

Theorem 4.5.1. Assume that rank(X) = k +1. Then

β = (XT X)−1XT Y.

Remark. rank(X) = k+1 is a very weak assumption, since each carefully designed experimentshould have the property that rank(X) = k+1. For the example below (k = 1) this is achievedby observing the response variable for at least 2 different values of x, since in this case, (1,x1)and (1,x2) are linearly independent.

Example 49. For k = 1, we have y = β0 +β1x+ ε . So assume, that we have observed n pairs(xi,yi). Then the design matrix is given by

X :=

1 x1...

...1 xn

.

Straightforward calculations now give

XT X :=(

1 1 · · · 1x1 x2 · · · xn

)1 x1...

...1 xn

=(

n ∑xi

∑xi ∑x2i

),

(XT X)−1 =1

n∑x2i − (∑xi)2

(∑x2

i −∑xi−∑xi n

),

XT Y =(

∑yi

∑xiyi

),

β =1

n∑x2i − (∑xi)2

(∑x2

i −∑xi−∑xi n

)(∑yi

∑xiyi

)=

1n∑x2

i − (∑xi)2

(∑x2

i ∑yi−∑xi ∑xiyin∑xiyi−∑xi ∑yi

).

Set x := 1n ∑xi, and y := 1

n ∑yi. Using the identities

n

∑i=1

(xi− x)2 =n

∑i=1

x2i −nx2 =

n

∑i=1

x2i −

1n

(n

∑i=1

xi

)2

n

∑i=1

(xi− x)(yi− y) =n

∑i=1

xiyi−nx(y) =n

∑i=1

xiyi−1n

(n

∑i=1

xi

)(n

∑i=1

yi

),

it follows that

β1 = ∑ni=1(xi− x)(yi− y)

∑ni=1(xi− x)2 .

103

Furthermore, since XT X(β0,β1) = XT Y, it follows that

nβ0 +β1

n

∑i=1

xi =n

∑i=1

yi ⇒ β0 = y− xβ1.

104

Chapter 5

Fourier analysis (ESM 2B)

The mathematical theory of Fourier analysis was invented by the engineer Joseph Fourier inorder to solve the heat equation. However, it took some time to obtain a precise mathematicalformulation. In fact, it turned out that the notion of infinite dimensional vector spaces incombination with a notion of convergence of functions play the key role in here. In order tomotivate this, we will first give a discussion of several types of convergence for functions,since they are in the center of the mathematical framework of Fourier analysis.

So let V := f : A→ F be the (vector) space of functions from A to F, where A is either aclosed interval of R (i.e. A = [a,b], for some a,b ∈ R), or A = R. As it was mentioned above,we now have to develop suitable concepts for the convergence of a sequence of functions( fn : n ∈ N) to another function f ∈V .

Pointwise convergence. The most intuitive notion of convergence is the following. We saythat fn converges to f pointwise, if for each x ∈ A,

limn→∞

fn(x) = f (x).

Uniform convergence. The next type of convergence is informally the following: we say thatfn converges to f uniformly, if the maximal distance between the graphs of fn and ftends to zero, as n tends to infinity. Mathematically, this is formulated as follows. Thesequence of functions ( fn) converges to f uniformly, if

sup| fn(x)− f (x)| : x ∈ A

tends to zero as n tends to infinity.

In here, recall that the supremum of a set B⊂ R is denoted by sup(A), and is defined asfollows. The element b ∈ R∪∞ is called the supremum of B ⊂ R,B 6= /0, if b is thesmallest element in R such that

x≤ b for all x ∈ B.

In complete analogy, the infimum of a set B⊂ R is denoted by inf(A), and is defined asfollows. The element a ∈ R∪−∞ is called the infimum of B ⊂ R,B 6= /0, if a is thebiggest element in R such that

x≥ a for all x ∈ B.

105

For example, sup([a,b]) = sup([a,b)) = b, and inf([a,b]) = inf((a,b]) = a. Furtherexamples are

sup(1n

: n ∈ N) = 1, inf(1n

: n ∈ N) = 0.

For convenience, one uses alternatively the following way of writing sup and inf:

supn∈N

(1n) := sup(1

n: n ∈ N), or inf

x∈Af (x) := inf f (x) : x ∈ A.

Lp-convergence. A further notion of convergence relies on convergence in average, and istherefore also called convergence in the p-th mean, where p is an element of [1,∞). Thisnotion is defined as follows. The sequence of functions ( fn) converges to f in Lp, if

limn→∞

∫A| fn(x)− f (x)|pdx = 0.

In here,∫

A either refers to∫ b

a if A = [a,b], or to∫

∞

−∞for A = R, respectively.

In order to illustrate these convergences, we will look at the following example. Let A =[0,1], and fn(x) = xn. Then

limn→∞

fn(x) = limn→∞

xn =

1 x = 10 else.

If we now consider f (x) := lim fn(x), it follows by definition that fn converges to f pointwise.However, the sequence ( fn) converges to f not uniformly. For each n ∈ N, there exists y ∈[0,1) such that fn(y)− f (y) = 1/2 (set y := n

√1/2). Hence,

sup| fn(x)− f (x)| : x ∈ [0,1] ≥ 12.

Hence limn→∞ sup| fn(x)− f (x)| : x ∈ [0,1] ≥ 1/2. So, ( fn) does not converge uniformly tof . Finally, with respect to Lp-convergence, note that∫ 1

0|xn− f (x)|pdx =

∫[0,1)

(xn)pdx+∫ 1

10d p =

∫ 1

0xnpdx =

1np+1

→ 0.

Hence, fn converges to f in Lp.Some relations between these types are rather intricate, so we will name only some of them.

It is rather easy to see, that uniform convergence implies pointwise convergence. Moreover, ifA = [a,b], then uniform convergence also implies Lp-convergence. However, Lp-convergencedoes not have to imply pointwise convergence, and vice versa.

106

5.1 Banach and Hilbert spacesIn this section, we will present the key objects for the development of the Fourier expansion,and related results. We begin with the definition of a normed vector space.

Definition 5.1.1. Let V be a F-vector space, and ‖ · ‖ : V → R be a map with the followingproperties.

(i) ‖v‖ ≥ 0 for all v ∈V .

(ii) ‖v‖= 0 if and only if v = 0.

(iii) ‖λv‖= |λ | ‖v‖, for all v ∈V , λ ∈ F.

(iv) ‖u+ v‖ ≤ ‖u‖+‖v‖, for all u,v ∈V .

In this situation, ‖ · ‖ is called a norm, and V (or in order to be precise, the pair (V,‖ · ‖)) iscalled a normed vector space.

Standard, finite dimensional examples for normed vector spaces are (F, | · |), where | · |refers to the absolute value, and (Fn,

√(·, ·)), where (·, ·) refers to the standard scalar product.

In particular, the notion of a norm immediately gives rise to the notion of convergence.Namely, a sequence (vn) in V is called to converge to v ∈V in norm, if

limn→∞‖vn− v‖= 0.

With respect to the finite dimensional example (F, | · |), this convergence is nothing else thenthe one we know already: Assume that (xn) is a sequence in F. Then limxn = x if and only iflim |xn− x|= 0.

Furthermore, with respect to (Fn,√

(·, ·)), and vk =(x(k)1 , . . . ,x(k)n)

T ∈Fn, v =(x1, . . . ,xn)T ∈Fn, we obtain

limk→∞

√(vk− v,vk− v)) = 0 ⇐⇒ lim

k→∞

n

∑i=1|x(k)

i − xi|2 = 0

⇐⇒ lim |x(k)i − xi|= 0 for all i = 1, . . . ,n.

Hence, convergence in norm in Fn means usual convergence in each coordinate.With respect to the infinite dimensional examples above, we already have two candidates

for norms at hand. Denote by, for f ∈ f : A→ F,

‖ f‖∞ := supx∈A| f (x)|, and ‖ f‖p :=

(∫A| f |pdx

) 1p

,

whenever the corresponding expression makes sense. It is easy to see that uniform conver-gence is convergence with respect to the norm ‖ ·‖∞, and that Lp-convergence is equivalent toconvergence with respect to ‖ · ‖p, respectively. It hence remains to find the associated vector

107

spaces. With respect to ‖ ·‖∞, the answer is the vector space of continuous functions C([a,b]),that is

C([a,b],F) := f : [a,b]→ F : f continuous.

In here, we may drop F sometimes for convenience. Note that, since [a,b] is a bounded andclosed interval in R, each f ∈C([a,b]) attains the maximum of | f (x)| for some x ∈ [a,b]. Thatis, for given f , there exists xmax with

f (xmax) = supx∈[a,b]

| f (x)|= ‖ f‖∞.

In particular, ‖ f‖∞ < ∞ for all f ∈C([a,b]). The proof, that ‖ · ‖∞ is in fact a norm is left asan exercise.

With respect to ‖ · ‖p, the associated space is

Lp(A,F) := f : A→ F : ‖ f‖p < ∞,

where A is either [a,b] or R. Obviously, properties (i) and (iii) are satisfied by ‖ ·‖p, and prop-erty (iv) is known as Minkowski’s inequality, which will not be presented in here. However,in a strict sense, ‖ · ‖p does not satisfy property (i), since the integral of the absolut value of afunction might be equal to zero even if the function is not equal to zero everywhere (see e.g.limxn in the above example). In order to solve that problem, two elements f ,g ∈ Lp(A) aresaid to be equivalent if

∫| f − g|dx = 0.1 For ease of notation, we write f Lp

= g in this case.Note that this is no standard notation, but it is beyond the scope of this course to explain thenotion of an identity which holds ‘almost surely’.

Completeness and separability

In order to define separability, one has to distinguish between to types of infinity - a set mightby countably infinite (e.g. N) or uncountably infinite (e.g. R). For the precise definition, seeDefinition 3.2.1.

The reason why one started to consider R instead of the set of rational numbers Q was thefollowing observation. By geometric constructions (e.g. the length of the diagonal of the unitsquare) or by taking limits, it turned out that one may obtain ‘numbers’ which are no longer inQ.2 To circumvent this problem, the set R was constructed by adding all limits of convergentsequences in Q.3 Moreover, this construction led to an object with the following properties.

1In mathematical terms, this gives rise to a so called equivalence relation. The set of equivalent functionsis then called a equivalence class. By treating each equivalence class as a single element, one obtains the spaceL p(A), for which ‖ · ‖p is in fact a norm.

2This was already known by the greeks - Hippasos of Metapont proved that in a pentagram, irrational numbersdo appear. In particular, this led to a contradiction of one of Pythagoras’ philosophical axioms. Due to a rumour,he therefore arranged that Hippasos was drowned.

3In order to be precise, one has to define what a convergent sequence is, if the limit is not known, or if thelimit is not an element of the ambient space. In mathematical terms, this is formulated as follows. A Cauchysequence is a sequence (xn) in Q (or R), such that for all ε > 0, there exists N(ε) ∈ N such that |xn− xm| ≤ ε

for all n,m > N(ε). Note that, by replacing | · | with ‖ · ‖ in the definition, the notion of a Cauchy sequence alsomakes sense with respect to normed spaces.

108

(i) The limit of each convergent sequence (to be precise: of each Cauchy sequence, seefootnote) is an element of R. A space with this property is called complete.

(ii) There exists a countable subset of R (e.g. Q), such that the set of limits of convergentsequences in this set (to be precise again: of Cauchy sequences, see footnote) is equalto R. A space with this property is called separable.

Due to the importance of these notions for R, one is tempted to apply these these concepts(including the related methods of proof) also in the theory of normed spaces. In fact, mostof the following results rely on the fact, that the underlying normed spaces are separable andcomplete. Even if no further details of these concepts will be discussed in here, it will turn outthat these two properties guarantee e.g. the existence of orthonormal bases of L2(R).

We are now in position to introduce the notion of a Banach space: a normed space is calleda Banach space if it is complete and separable.

Theorem 5.1.2.(i) The normed space C([a,b],F) is a Banach space.

(ii) The normed space Lp(A,F) is a Banach space, for a either A = R or A = [a,b] for somea,b ∈ R (Theorem of Riesz-Fischer).

In order to define the expansion of a function in terms of a Fourier series, we need thenotion of an orthonormal basis. The first step in this direction is to define an inner product asin the finite dimensional case.

Definition 5.1.3. A vector space H is called Hilbert space, if there exists an inner product(·, ·) on H such that H is a complete space with respect to the norm ‖ · ‖=

√(·, ·).

Recall that an inner product in this context is defined (as in the finite dimensional case) asa map (·, ·) : H×H→ F with, for all u,v,w ∈V , and λ ∈ F,

(i) (v,w) = (w,v),

(ii) (u+ v,w) = (u,w)+(v,w),

(iii) (v,λw) = λ (v,w),

(iv) (v,v) > 0 for all v 6= 0.

Note that, by combining (i) and (iii), we have

(λv,w) = λ (v,w).

In here, we will make extensively use of the Hilbert space L2([a,b],F), where

( f ,g) :=∫ b

af gdx.

In order to see that this in fact is a Hilbert space, we have to verify that (·, ·) is an inner product(which is easy), and that the space is complete. Since | f |2 = f f , the Riesz-Fischer-Theoremgives that L2([a,b]) is in fact complete. Moreover, note that there are several results from finitedimensional vector spaces like the Cauchy-Schwarz-inequality and the Pythagorean theorem,which remain true in this infinite dimensional setting.

109

5.1.1 Orthonormal bases for separable Hilbert spacesAs in the finite dimensional case, we are now searching for bases, or more precisely for or-thonormal bases, in order to treat elements of this function space explicitly.

In order to see that for a separable Hilbert space, an orthonormal basis always exist, wehave to introduce the following notions.

Definition 5.1.4. A subset A of a Hilbert space H is called complete, if for each element v∈H,there exists a sequence vn in span(A) with


Recall that span(A) for arbitrary sets is defined to be the set of all finite linear combinations.Moreover, a subset A⊂H is called orthogonal, if (v,w) = 0 for all v,w∈ A, v 6= w. Moreover itis called orthonormal, if it is orthogonal, and (v,v) = 1 for all v ∈ A. Finally, an orthonormalbasis for H is defined as an orthonormal and complete subset of H.

So assume that H is separable. Then there even exists A⊂ H countable such that for eachelement v ∈ H, there exists a sequence vn in A (and not in span(A)) with


Note that, even for finite dimensional vector spaces, the set A has to be countably infinite.4 Bychoosing an enumeration of the set A, the set A can be written as

A = bn : n ∈ N.

By applying the Gram-Schmidt-procedure, we hence obtain an orthonormal set

B = cn : n ∈ N

which moreover is complete (this is a consequence of Proposition 2.3.10, which can be appliedin this situation since the span is defined using only finite linear combinations, see remark onpage 9). So we have shown the following.

Theorem 5.1.5. For a separable Hilbert space H, there exists an orthonormal basis.

One of the main advantages in this situation is that an orthonormal basis allows to determinethe coefficients with respect to the above orthonormal, complete set explicitly. For ease ofnotation, if the limit exists, set

∑n∈J

vn := limN→∞

∑n∈J,n≤N

,

where (vn) is a sequence in H, and J is some countable subset of J.

Theorem 5.1.6. Let en : n ∈ J be an orthonormal system (not necessarily complete), whereis J ⊂ N is either finite or countably infinite. Then, for all v ∈ H, we have the following.

4cf. Qn ⊂ Rn.

110

(i) Bessel’s inequality:∑n∈J|(v,en)|2 ≤ ‖v‖2.

(ii) Parseval’s identity:5

∑n∈J|(v,en)|2 = ‖v‖2 if and only if ∑

n∈J(v,en)en = v.

Proof. For N ∈ N, letvN := ∑

n∈J,n≤N(v,en)en.

It then follows that (v− vN ,vN) = 0, This can be seen by the following argument, where theresult for orthonormal bases in Equation 2.2 is applied.

(v− vN ,vN) = (v,vN)− (vN ,vN) = (v,vN)−‖vN‖2

= (v, ∑n∈J,n≤N

(v,en)en)−‖vN‖2

= ∑n∈J,n≤N

(v,en)(v,en)−‖vN‖2

= ‖vN‖2−‖vN‖2 = 0.

Hence by the Pythagorean theorem (and Equation 2.2),

0≤ ‖v− vN‖2 = ‖v‖2−‖vN‖2 = ‖v‖2− ∑n∈J,n≤N

|(v,en)|2.

Hence, by taking the limit as N→ ∞, the assertion follows.

This result now leads to the following theorem, which summarises properties of an or-thonormal basis, and, in particular, gives that the coefficients are uniquely determined.

Theorem 5.1.7. Let ei : i ∈ J be a countable, orthonormal basis for the separable Hilbertspace H. Then, for each v ∈ H, there exists a uniquely determined sequence (ci : i ∈ J) ofelements in F such that

v = ∑i∈J

ciei.

Furthermore, ‖v‖2 = ∑i∈J |ci|2, and ci = (v,ei) for all j ∈ J.

Proof. In order to see that the sequence ci is unique, assume that

v = ∑i∈J

ciei = ∑i∈J

c′iei.

Then, using the same orthogonality arguments as in the proof of Parseval’s identity, it followsthat, for all n ∈ N,

∑i∈J,i≤N

|ci− c′i|2 = 0.

Hence, the sequence ci is unique. The remaining parts are a consequence of Parseval’s identityand the fact that an orthonormal basis is complete.

5Due to applications, this result is sometimes also called Parseval’s energy conservation theorem.

111

5.2 Fourier series in L2([0,L])

We now apply the theory to L2([a,b]) for a,b ∈ R, a < b, where we have to distinguish be-tween the space of real and complex valued functions. They are denoted by L2([a,b],R),and L2([a,b],C), respectively. Recall that with respect to these spaces (see the remark on thenormed spaces Lp), we have the following problem. If ‖ f − g‖2 = 0, then this does not nec-essary means that f (x) = g(x) for all x ∈ [a,b] (in fact, this is only true for ‘almost all’ x). Forease of exposition, we will use the non-standard notion f L2

= g for ‖ f −g‖2 = 0 in here.

5.2.1 The Fourier basisIn order to start the investigations, and obtain explicit representations of functions, one is inneed for an explicit orthonormal basis. These are given by the following theorem which willnot be proved in here.

Theorem 5.2.1. The set of functions from [0,2π] to R

1,cosx,sinx,cos(2x),sin(2x),cos(3x),sin(3x), · · ·.

is an orthogonal and complete set for L2([0,2π],R). Furthermore, the set of functions from[0,2π] to C

einx : n ∈ Z.is an orthogonal and complete set for L2([0,2π],C).

In order to obtain an orthonormal basis, we have to normalize this vectors! This is done bycalculating their norms, for n ∈ N, m ∈ Z.

‖1‖2 =∫ 2π

012dx =

∫ 2π

01dx = 2π,

‖cos(nx)‖2 =∫ 2π

0(cos(nx))2dx =

∫ 2π

0

12(1+(cos(2nx)))dx =

(x2

+sin(2nx)

4n

)∣∣∣∣2π

0= π,

‖sin(nx)‖2 =∫ 2π

0(sin(nx))2dx =

∫ 2π

0(1− (cos(2nx))2)dx = 2π−π = π.

‖eimx‖2 =∫ 2π

0eimxeimxdx =

∫ 2π

0eimxe−imxdx =

∫ 2π

01dx = 2π.

As a consequence of these calculations, we now obtain the following orthonormal bases.

Proposition 5.2.2. The set of functions from [0,2π] to R1√2π

,cosx√

π,sinx√

π,cos(2x)√

π,sin(2x)√

π,cos(3x)√

π,sin(3x)√

π, · · ·

is an orthonormal basis for L2([0,2π],R). Furthermore, the set of functions from [0,2π] to Ceinx√

2π: n ∈ Z

.

is an orthonormal basis for L2([0,2π],C).

112

With respect to applications, we may generalize these results to L2([0,L]), for some L > 0.This is done via a change of variable x→ (2πx)/L. Using the similar calculations as above,we obtain the following.

Theorem 5.2.3. The set of functions from [0,L] to R1√L,

√2L

cos2πx

L,

√2L

cos2πx

L,

√2L

sin2πx

L,

√2L

cos4πx

L,

√2L

sin4πx

L,

√2L

cos6πx

L, · · ·

is an orthonormal basis for L2([0,L],R). Furthermore, the set of functions from [0,L] to C

1√L

e2πnix

L : n ∈ Z

.

is an orthonormal basis for L2([0,L],C).

5.2.2 Fourier coefficientsWe are now in position to apply Parseval’s identity in order to determine the coefficients fora given function in L2([a,b]). For the real case, we obtain α0 ∈ R and αn,βn ∈ R, for n ∈ N,such that

f (x) L2

= α0 ·1√L

+∞

∑n=1

αn ·√

2L

cos2πnx

L+βn ·

√2L

sin2πnx

L,

where these coefficients are determined by

α0 = ( f , 1√L) =

1√L

∫ L

0f (x)dx,

αn = ( f ,√

2L cos 2πnx

L ) =

√2L

∫ L

0f (x)cos

2πnxL

dx,

βn = ( f ,√

2L sin 2πnx

L ) =

√2L

∫ L

0f (x)sin

2πnxL

dx.

Now a simple calculation gives an equivalent representation of f :

f (x) L2

= α0 ·1√L

+∞

∑n=1

(αn ·

√2L

cos2πnx

L+βn ·

√2L

sin2πnx

L

)

=1L

∫ L

0f (x)dx+

2L

∞

∑n=1

(∫ L

0 f (x)cos 2πnxL dx) · cos

2πnxL

+2L

∞

∑n=1

(∫ L

0 f (x)sin 2πnxL dx) · sin

2πnxL

=2L

(a0

2+

∞

∑n=1

(an · cos

2πnxL

+bn · sin2πnx

L

)),

113

where

a0 :=∫ L

0f (x)dx, an :=

∫ L

0f (x)cos

2πnxL

dx, bn :=∫ L

0f (x)sin

2πnxL

dx.

Since, for a given function f , these calculations are easier, in most text books for appliedmathematics the above coefficients a0,an,bn are called the Fourier coefficients of f (insteadof α0,αn,βn), and the above series is called Fourier series (in L2([0,L],R)).

In the complex case, the same arguments applies. Namely, for f ∈ L2([0,L],R), with

f (x) L2

= ∑n∈Z

γn ·1√L

e2πnix

L , where

γn := ( f , 1√L

e2πnix

L ) =∫ L

0f (x) · 1√

Le

2πnixL dx =

1√L

∫ L

0f (x) · e 2πnix

L dx,

we have

f (x) L2

= ∑n∈Z

γn ·1√L

e2πnix

L = ∑n∈Z

(1√L

∫ L

0f (x) · e 2πnix

L dx)· 1√

Le

2πnixL

=1L ∑

n∈Z

(∫ L

0f (x) · e 2πnix

L dx)· e

2πnixL

=1L ∑

n∈Zcn · e

2πnixL , where

cn :=∫ L

0f (x) · e 2πnix

L dx =∫ L

0f (x) · e−

2πnixL dx.

As above, the coefficients cn are called the Fourier coefficients of f (instead of γn), and theabove series is called Fourier series (in L2([0,L],C)).

Euler’s formula, and the relation between real and complex Fourier coefficients

Recall that Euler’s formula states, for x ∈ R, that

eix = cosx+ isinx.

114

We now write f ∈ L2([0,L],R) as a Fourier series with coefficients an,bn in L2([0,L],R) andwith coefficients cn in L2([0,L],C) as above. Hence, using Euler’s formula,

f (x) L2

=2L

(a0

2+

∞

∑n=1

(an · cos

2πnxL

+bn · sin2πnx

L

)), and

f (x) L2

=1L ∑

n∈Zcn · e

2πnixL =

1L ∑

n∈Zcn ·(

cos2πnx

L+ isin

2πnxL

)=

c0

L+

1L

∞

∑n=1

(cn · cos

2πnxL

+ c−n · cos−2πnx

L

)+

1L

∞

∑n=1

(cn · isin

2πnxL

+ c−n · isin−2πnx

L

)=

c0

L+

1L

∞

∑n=1

(cn · cos

2πnxL

+ c−n · cos2πnx

L

)+

1L

∞

∑n=1

(cn · isin

2πnxL− c−n · isin

2πnxL

)Since the Fourier coefficients for a given function f ∈ L2([0,L],R) are unique, it follows, forn ∈ N, that

a0

L=

c0

L,

2L

an =1L(cn + c−n),

2L

bn =1L(icn− ic−n).

We have proven the following proposition.

Proposition 5.2.4. For the Fourier coefficients an,bn with respect to L2([0,L],R) and cn withrespect to L2([0,L],C) for a given function f ∈ L2([0,L],R), we have

a0 = c0, an =12(cn + c−n), bn =

12(icn− ic−n).

Recall that with respect to the coefficients given by the orthonormal basis of Theorem 5.2.3,we have

α0 =1√L

a0, αn =

√2L

an, βn =

√2L

bn, γn =1√L

cn.

Hence, in this case, we obtain the following relations.

Proposition 5.2.5. For the Fourier coefficients αn,βn and γn with respect to the orthonormalbasis of Theorem 5.2.3,

α0 = γ0, αn =1√2(γn + γ−n), βn =

1√2(iγn− iγ−n).

5.3 Fourier series for periodic functionsThere is an immediate extension of the above considerations to period functions. Recall that afunction f : R→ F is called periodic with period L, if f (x) = f (x+L) for all x∈R. Moreover,note that ∫ L+y

yf (x)dx =

∫ L

0f (x)dx

115

for all y ∈ R. Since the elements of the Fourier basis for L2([0,L]) are all periodic withperiod L, we immediately obtain that the following class of periodic functions admits a Fourierexpansion.

Theorem 5.3.1. Assume that f : R→ F is periodic with period L, and that∫ L+y

y| f (x)|2dx < ∞

for some x ∈ R. Then there exists an expansion of f as Fourier series. That is, if f is a realvalued function (and periodic), then

f (x) L2

=2L

(a0

2+

∞

∑n=1

(an · cos

2πnxL

+bn · sin2πnx

L

)), where

a0 :=∫ y+L

yf (x)dx, an :=

∫ y+L

yf (x)cos

2πnxL

dx, bn :=∫ y+L

yf (x)sin

2πnxL

dx,

for some y ∈ R. Furthermore, if f is complex valued, then

f (x) L2

=1L ∑

n∈Zcn · e

2πnixL , where

cn :=∫ y+L

yf (x) · e−

2πnixL dx,

for some y ∈ R.

Remark. It remains to discuss what ‘ L2

=’ means in this context. Recall that for g,h ∈ L2([a,b]),g L2

= h is the abbreviated notation for ∫ b

a| f −g|2dx = 0.

So, in the above theorem, ‘ L2

=’ means that, for some (and hence all) y ∈ R,

∫ y+L

y

(f − 2

L

(a0

2+

∞

∑n=1

(an · cos

2πnxL

+bn · sin2πnx

L

)))2

dx = 0,

and ∫ y+L

y

∣∣∣∣∣ f − 1L ∑

n∈Zcn · e

2πnixL

∣∣∣∣∣2

dx = 0,

respectively.

Example 50. We now give an example for an expansion in terms of the Fourier basis. Soconsider the function in f ∈ L2([−π,π]) given by

f (x) :=

0 −π ≤ x < 0x 0≤ x≤ π.

116

The coefficients of f with respect to the complex Fourier basis are now given by the followingcalculations. For n = 0, we have

c0 =∫

π

−π

f (x) ·1dx =∫ 0

−π

0dx+∫

π

0xdx =

π2

2.

For n 6= 0, using partial integration and enπi = (−1)n, it follows that

cn =∫

π

−π

f (x) · einxdx =∫

π

0x · e−inxdx

=x−in

e−inx∣∣∣∣π0−∫

π

0

1−in

e−inxdx =π

−ine−inπ − 1

−in

2e−inx

∣∣∣∣π0

=π

−ine−inπ − 1

−in

2(e−inπ −1) =

iπn

e−inπ +1n2 (e−inπ −1)

=iπ(−1)n

n− 1− (−1)n

n2 =

iπn n even− 2

n2 − iπn n odd.

Using Proposition 5.2.4, we hence obtain the coefficients with respect to the expansion with

−6 −4 −2 0 2 4 6

01

23

Figure 5.1: Approximations for n = 2 (green), 5 (red), 15 (blue).

117

respect to L2([−π,π],R), that

a0 = c0 =π2

2

an =12(cn + c−n) =

12

(iπ(−1)n

n− 1− (−1)n

n2 +iπ(−1)−n

−n− 1− (−1)−n

n2

)=

(−1)n−1n2 =

0 n even− 2

n2 n odd

bn =12(icn− ic−n) =

12

(−1π(−1)n

n− i

1− (−1)n

n2 − −1π(−1)−n

−n+ i

1− (−1)−n

n2

)=−π(−1)n

n=

−π

n n evenπ

n n odd.

The approximations of the Fourier series are depicted in Figure 5.1.

5.3.1 Function classes with stronger convergence propertiesFor a given function f : R→ C, the question is still open, for which values x ∈ R the fourierseries converges to f (x). This problem is illustrated for the function given by

f (x) =

1 x ∈ R\Q0 x ∈Q.

Since the values of a discrete set do not change the value of the integral, the fourier series isequal to 1 everywhere. Hence, the Fourier series gives the ”wrong” value for each x ∈Q. Wenow state the following theorems without proof, which provide a priori knowledge about thepointwise convergence of the Fourier series.

Theorem 5.3.2 (Dirichlet’s condition). Assume that f is a periodic function with period Lsuch that in the interval [0,L],

(i) f is continuous up to at most finitely many discontinuities,

(ii) in each interval of continuity, there are only finitely many maxima and minima,

(iii)∫ L

0 | f (x)|dx < ∞.

Then the Fourier series converges at each point of continuity x to f (x), and if f is discontinu-ous at x0, then

1L ∑

n∈Zcn · e

2πnixL =

12( lim

x→x0+f (x)+ lim

x→x0−f (x))

In terms of the different types of convergences, the theorem implies that for the subspaceof periodic continuous functions of period L with finitely many maxima and minima in eachperiod, we have pointwise convergence and convergence in L2. Furthermore, the next theoremtells us for which cases we even have uniform convergence.

118

Theorem 5.3.3. Assume that f is a continuous periodic function with period L such that f ispiecewise continuous differentiable with respect to a finite partition of [0,L]. Then the Fourierseries converges uniformly to f .

Remark. The condition that f is piecewise continuous differentiable with respect to a finitepartition of [0,L] means that there exist 0 = t0 < t1 < .. . tk = L such that f restricted to [ti−1, ti]is differentiable with continuous derivative for all i = 1, . . . ,k.

5.3.2 Fourier coefficients of even and odd functionsFor a given function, the following easy observations are sometimes helpful in order to deter-mine the Fourier expansion of f .

Proposition 5.3.4. Let f : [−L/2,L/2]→ R be a function in L2([−L/2,L/2]).

(i) If f is even (i.e. f (−x) = f (x)), then the bn are all equal to 0. Hence the Fourierexpansion is given by

2L

(a0

2+

∞

∑n=1

an · cos2πnx

L

).

(ii) If f is odd (i.e. f (−x) =− f (x)), then the an (n≥ 0) are equal to 0. Hence the Fourierexpansion is given by

2L

(∞

∑n=1

bn · sin2πnx

L

).

(iii) If f − f (0) is odd, then the an (n ≥ 1) are equal to 0. Hence the Fourier expansion isgiven by

2L

(f (0)

2+

∞

∑n=1

bn · sin2πnx

L

).

The proof of this proposition relies on the fact, that for an odd and integrable functionf : [−L/2,L/2]→ R, we have, using the substitution x =−y,∫ L/2

−L/2f (x)dx =

∫ 0

−L/2f (x)dx+

∫ L/2

0f (x)dx =

∫ L/2

0f (x)dx−

∫ 0

−L/2f (−x)dx

=∫ L/2

0f (x)dx−

∫ L/2

0f (y)dy = 0.

Finally note that for an even function f , the function f (x)sin(2πnx/L) is an odd function, andif f is odd, then f (x)cos(2πnx/L) is also odd. With respect to periodic functions, the aboveresults reads as follows.

Proposition 5.3.5. Let f : R→R be a periodic function of period L such that∫−L/2 L/2 f 2(x)dx <

∞.

(i) If f is even, then the bn are all equal to 0.

119

(ii) If f is odd, then the an (n≥ 1) are equal to 0.

In particular, it is sometimes useful, to expand a given function f : [0,L]→ R to a periodicfunction which is either even or odd. This can be done in several ways.

Expansion to an even function. Let g : [−L,L]→ R be given by

g(x) :=

f (x) x ∈ [0,L]f (−x) x ∈ [−L,0).

Then the periodic extension of g to a function defined on R is an even function.

textitExpansion to an even function. Let g : [−L,L]→ R be given by

g(x) :=

f (x) x ∈ (0,L]0 x = 0− f (−x) x ∈ [−L,0).

Then the periodic extension of g to a function defined on R is an odd function.

5.3.3 Expanding functions to continuous periodic functionsBy Theorems 5.3.2 and 5.3.3, we know that the Fourier series of periodic and continuousfunctions have better convergences properties. So assume that we want to determine a Fourierexpansion for a given continuous function f : [0,L]→C. In order to obtain good convergenceproperties as a consequence of Theorems 5.3.2 or 5.3.3, one simply expands the functionf : [0,L]→ C to a continuous function g : [0,L′]→ C for some L′ > L with f (x) = g(x) forall x ∈ [0,L], and g(0) = g(L′). Then periodic extension of g is continuous, and the Fourierseries of g converges to f (x) for all x ∈ [0,L] (provided that the assumptions of Theorem 5.3.2remain true). Furthermore, if the stronger conditions of Theorem 5.3.2 apply to g, then theconvergence is even uniform. We will illustrate that in the following example.

Example 51. Let f be given as in Example 50, that is f ∈ L2([−π,π]),

f (x) :=

0 −π ≤ x < 0x 0≤ x≤ π.

The function g : [−π,2π] is now defined by

f (x) :=

0 −π ≤ x < 0x 0≤ x≤ π

2π− x π ≤ x≤ 2π.

The Fourier coefficients are now determined by the standard procedure.

120

The coefficient c0.

c0 =∫ 2π

−π

g(x)dx =∫

π

0xdx+

∫ 2π

π

(2π− x)dx =π2

2+2π

2−2π2 +

π2

2= π

2.

The coefficients ck for k 6= 0 (using partial integration)

ck =∫ 2π

−π

g(x)e−2ki3 xdx

=∫

π

0xe−

2ki3 xdx+

∫ 2π

π

(2π− x)e−2ki3 xdx

=xe−

2ki3 x

−2ki3

∣∣∣∣∣π

0

−∫

π

0

e−2ki3 x

−2ki3

dx+(2π− x)e−

2ki3 x

−2ki3

∣∣∣∣∣2π

π

−∫ 2π

π

e−2ki3 x

2ki3

dx

=πe−

2ki3 π

−2ki3

− e−2ki3 π −1(2ki3

)2 − πe−2ki3 π

−2ki3

+e−

4ki3 π − e−

2ki3 π(2ki

3

)2

=1(2ki3

)2

(e−

4ki3 π −2e−

2ki3 π +1

)=− 9

4k2

(e−

2ki3 π −1

)2

=− 94k2 e−

2ki3 π

(e−

ki3 π − e

ki3 π

)2

=9k2 sin2

(kπ

3

)e−

2ki3 π .

Note that ck k−2 in contrast to example the coefficients tend to Example 50, where thecoefficients decreased with speed k−1. Hence, the Fourier series of g has less high frequencyparts, and in particular converges much faster. For convenience, the real coefficients are alsogiven.

The coefficient a0 = c0 = π2.

The coefficients ak for k ∈ N.

ak =12(ck + c−k) =

92k2 sin2

(kπ

3

)(e−

2ki3 π + e

2ki3 π

)=

9k2 sin2

(kπ

3

)cos(

2kπ

3

).

The coefficients bk for k ∈ N.

bk =i2(ck− c−k) =

9i2k2 sin2

(kπ

3

)(e−

2ki3 π − e

2ki3 π

)=

9i2k2 sin2

(kπ

3

)2isin

(−2kπ

3

)=

9k2 sin2

(kπ

3

)sin(

2kπ

3

).

121

−2 0 2 4 6

01

23

Figure 5.2: Approximations for n = 2 (green), 5 (red), 15 (blue), and the 15th approximationfrom Example 50 (dotted).

Approximations 2, 5, 15 for the function g are depicted in Figure 5.2. Moreover, the 15-thapproximation from Example 50 is also plotted for comparison of the speed of approximation.

5.3.4 An application of Fourier series to partial differential equationsThe expansion of a periodic function into a Fourier series is a powerful tool to solve differentialequations, since the Fourier basis consists of eigenfunctions for the differential operator. Inhere, we will present a solution for the heat equation on a ring. Note that the unit circle as asubset of C can be parametrized as follows.

S1 := z ∈ C : |z|= 1= eiθ : θ ∈ [0,2π].

Hence, in order to specify an initial heat distribution on S1, one has to fix some functionf0 : [0,2π]→ R with f(0) = f0(2π), and identify the temperature at eiθ ∈ S1 with f0(θ). Thechange in time of this distribution is now given by a solution f : [0,2π]× [0,∞)→ R of thepartial differential equation

∂ 2 f (θ , t)∂θ 2 =

∂ f (θ , t)∂ t

.

So assume that the Fourier series of f with respect to θ exists for every t, that is there existfunctions cn : [0,∞)→ C such that

f (θ , t) = ∑n∈Z

cn(t)einθ .

Furthermore, assume that we may interchange summation and differentiation. Then

∂ f (θ , t)∂ t

= ∑n∈Z

∂cn(t)∂ t

einθ , and∂ 2 f (θ , t)

∂θ 2 = ∑n∈Z,n6=0

(in)2cn(t)einθ .

122

Since the expansion in Fourier series is a unitary operator from L2([0,2π]) to l2(Z), we find asolution if and only if c0(t) is a constant function, and

∂cn(t)∂ t

= (in)2cn(t) =−n2cn(t),

which is an ordinary differential equation for all n ∈ Z, n 6= 0. It follows that

cn(t) = c(0)n e−n2t ,

where c(0)n denotes the n-th Fourier coefficient of f0.

0 1 2 3 4 5 6

1

2

3

4

Figure 5.3: Graphs of the solutions of the heat equation for t = 0, 18 , 1

4 , 12 ,1,2.

Example 52. Let the initial heat distibution be given by

f0(θ) =

θ 0≤ θ < π

2π−θ π ≤ θ ≤ 2π.

The Fourier expansion of this function is given by

f0(θ) =π

2− 4

π

(cos(θ)+

cos(3θ)32 +

cos(5θ)52 + · · ·

).

Hence, the solution of the heat equation is given by

f (θ , t) =π

2− 4

π

(e−t cos(θ)+ e−9t cos(3θ)

32 + e−25t cos(5θ)52 + · · ·

).

For the graph of f (θ , t), see Figure 5.3.

5.4 The Fourier transformThe expansion in terms of a Fourier series might be seen as a unitary operator from L2([a,b],C)to the following Hilbert space l2(C),6 which is defined by

l2(F) := (cn : n ∈ Z) : ∑n∈Z|cn|2 < ∞

((bn),(cn)) := ∑n∈Z

bncn.

6This also applies to L2([a,b],R) and l2(R).

123

The associated norm√

((cn),(cn)) will also be written as ‖(cn)‖2. So denote by f (n) the n-theFourier coefficient (with respect to the complex orthonormal basis). Then, for

Φ : L2([a,b],C)→ l2(C)

f := ( f (n) : n ∈ Z),

it follows by Parsevals identity that ‖Φ( f )‖2 = ‖ f‖2. Moreover, Φ is a linear (and bounded)7

operator. Moreover, since (b−a)−1/2 exp(2πinx/(b−a)) : n ∈ Z, we obtain that

( f ,g) = (( f (n)),(g(n))) = (Φ( f ),Φ(g)).

Hence, Φ is a unitary operator of infinite dimensional Hilbert spaces! Informally, the Fouriertransform is the continuous analogue of the expansion in a Fourier series. In particular, it is aunitary operator from L2(R) to L2(R), where the sequence ( f (n)) in case of a Fourier seriesis replaced by a function f : R→ C.

5.4.1 The Dirichlet and Fejer KernelThe Dirichlet Kernel

In order to motivate the definition of a convolution, we will look at the Fourier series of agiven function. So, for N ∈ N, and f ∈ L2([a,b]), we have

f = limN→∞

1b−a

(N

∑k=−N

∫ 2π

0f (x)e−

2πiksb−a ds

)e−

2πiktb−a .

This limit can now be rewritten as follows.

N

∑k=−N

(∫ b

af (s)e−

2πiksb−a ds

)e−

2πiktb−a =

N

∑k=−N

(∫ b

af (s)e

2πik(t−s)b−a ds

)=∫ b

a

N

∑k=−N

f (s)e2πik(t−s)

b−a ds.

So let

DN(ω) :=N

∑k=−N

eikω .

So the N-th approximation of the function f can be rewritten as

1b−a

∫ b

af (s)DN(2π(t− s)/(b−a))ds.

7A linear operator A is called bounded if sup‖A(x)‖/‖x‖ < ∞, which is equivalent to the continuity of theoperator.

124

Note that DN is called the N-th Dirichlet kernel, and has the following properties. Using theidentity (1− z)(1+ z+ z2 · · ·+ zn) = (1− zn+1), it follows that

N

∑k=−N

ekiω = e−Niω 1− e(2N+1)iω

1− eiω =e−Niω − e(N+1)iω

1− eiom

=e−((2N+1)iω)/2− e((2N+1)iω)/2

e−iω/2− eiω/2 =sin( (2N+1)ω

2 )sin(ω

2 ).

So for N ∈ N, note that the denominator is equal to zero if and only if ω = 2kπ for some

-10 -5 0 5 10

-10

-5

5

10

Figure 5.4: The Dirichlet kernel D4 (black), D8 (green), and the limiting shape (red).

k ∈ Z. For these elements, using l’Hopital’s rule, one obtains

DN(2kπ) =∂ sin((N +1/2)ω)/∂ω

∂ sin(ω/2)/∂ω

∣∣∣∣2kπ

= 2N +1.

Furthermore, note that

DN(ω) · sin(ω/2) = sin((N +1/2)ω) ∈ [−1,1].

As a consequence of this easy observation, it follows that

− 1|sin(ω/2)|

≤ DN(ω)≤ 1|sin(ω/2)|

,

where we have equality on one side for (N +1/2)ω = π/2+πZ, or equivalently, for

ω ∈

π +2kπ

2N +1: k ∈ Z

.

125

-10 -5 0 5

-2,5

2,5

5

7,5

10

Figure 5.5: The Fejer kernel F4 (black), F8 (blue) with limiting shapes, and the limiting shape(red) of the Dirichlet Kernels.

So the number of oscillations of DN grows linearly with N. (see Figure 5.4).In particular, it follows that the Dirichlet Kernel does not converge to a so called δ -

Funktion, an object which will be discussed later.By passing to the so called Cesaro-mean, it is possible to obtain an associated kernel which

has this property. The Fejer-Kernel is defined as follows (see Figure 5.5 for the graphs of F4and F8).

FN(ω) :=1N

N

∑k=0

Dk(ω).

By similar considerations as above, one can show that

FN(ω) :=1N

(sin(Nω

2 )sin(ω

2 )

)2

.

Hence, again using l’Hopital’s rule, FN(2kπ) = N2/N = N for all k ∈ Z. Furthermore, forω /∈ 2πZ, it follows again by the same argumentas above, that

0≤ FN(ω)≤ 1N(sin(ω/2))2 .

Hence, FN is periodic of period 2π , and for ω ∈ [−π,π], we have

limN→∞

FN(ω) =

0 ω ∈ [−π,π],ω 6= 0∞ ω = 0.

126

5.4.2 ConvolutionsDefinition 5.4.1. Let f ,g be functions from A to C, where A is either R or an interval [a,b](a,b ∈ R). If

f ∗g : R→ C, x 7→∫

∞

−∞

f (s)g(x− s)ds

defines a function from A to C, then f ∗g is called the convolution of f and g.

With respect with this notion, it follows that the N-th approximation of a function by itsFourier series can be written as

1b−a

f ∗DN(2πx/(b−a)).

From now on, let A = R. In order to see in which cases the convolution is defined, let L1bc(R)

be the normed vector space given by

L1bc(R) := f : R→ C : f is bounded and continuous, and∫

| f (x)|dx < ∞, ‖ f‖1 :=∫| f (x)|dx < ∞.

For functions from this space, the convolution is always defined.

Theorem 5.4.2. For f ,g,h ∈ L1bc(R), we have

(i) f ∗g ∈ L1bc(R),

(ii) f ∗g = g∗ f ,

(iii) f ∗ (g∗h) = ( f ∗g)∗h,

(iv) f ∗ (g+h) = f ∗g+ f ∗h.

Proof. In order to prove (i), fix C > 0 such that |g(x)| ≤ C for all x ∈ R (which is possiblesince g is bounded). Since

|∫

f (s)g(t− s)ds| ≤∫| f (s)||g(t− s)|ds≤C‖ f‖1 < ∞,

the convolution f ∗g is defined, ‖ f ∗g‖ < ∞, and f ∗g is bounded. It hence remains to showthat f ∗g is continuous. So assume that g is uniformly continuous. That is, for all ε > 0 thereexists δ > 0 such that

|g(x)−g(y)|< ε for all x,y with |x− y|< δ .

For x0 ∈ R, and x ∈ R with |x− x0|< δ ,

| f ∗g(x0)− f ∗g(x)|=∣∣∣∣∫ f (s)g(x0− s)ds−

∫f (s)g(x− s)

∣∣∣∣=∣∣∣∣∫ f (s)(g(x0− s)−g(x− s))ds

∣∣∣∣≤∫| f (s)| |(g(x0− s)−g(x− s))|ds

≤ ‖ f‖1ε.

127

By choice of εn 0, and δn , it follows that limx→x0 f ∗ g(x) = f ∗ g(x0), and hence theconvolution is continuous. The general case with g is only continuous follows from this asfollows. For x0 ∈ R, choose T > 2|x0| such that∫

|s|>T| f (s)|ds < ε.

Then g is uniformly continuous on [−T,T ]. Furthermore, for |x− x0|< δ (where δ is chosenaccording to the uniform continuity of g on [−T,T ]),

| f ∗g(x0)− f ∗g(x)|=∣∣∣∣∫ f (s)g(x0− s)ds−

∫f (s)g(x− s)

∣∣∣∣≤∣∣∣∣∫ f (s)(g(x0− s)−g(x− s))

∣∣∣∣ds

≤∣∣∣∣∫ T

−Tf (s)(g(x0− s)−g(x− s))

∣∣∣∣ds+Cε

≤ ‖ f‖1ε +Cε.

The proofs of the remaining assertions are easier, and omitted.

5.4.3 The Fourier transformThe Fourier transform of a function f : R→ C might be seen as a continuous analogue of theexpansion in terms of a Fourier basis - the summation is replaced by its continuous analogue,namely by integration. The Fourier transform of f is also a function f : R→ C, which isdefined by

f (y) :=∫

f (x)eixydx,

provided f exists for (almost8) all y ∈ R.

Proposition 5.4.3. Assume that f ,g ∈ L1bc(R). Then

(i) The Fourier transform f exists, and is continuous and bounded.

(ii) For g(x) := f (x)eiax, we have g(y) = f (y−a).

(iii) For g(x) := f (x−a), we have g(y) = f (y)e−iay.

(iv) For h := f ∗g, we have h(y) = f (y)g(y).

(v) For λ > 0 and g(x) := f (x/λ ), we have g(y) = λ f (λy).

(vi) For f with f ′ ∈ L1bc(R), we have f ′(y) = iy f (y).

8in fact, one only requires that the function f exists except for a set of points of measure 0

128

Proof. (i) For y ∈ R, we have that

| f (y)|=∣∣∣∣∫ f (x)e−ixydx

∣∣∣∣≤ ∫ ∣∣ f (x)e−ixy∣∣dx≤∫| f (x)|

∣∣e−ixy∣∣≤ ‖ f‖1.

For h > 0, we have

| f (y)− f (y+h)|=∣∣∣∣∫ f (x)e−ixydx−

∫f (x)e−ix(y+h)dx

∣∣∣∣=∣∣∣∣∫ f (x)e−ixy

(1− e−ixh

)dx∣∣∣∣

≤∫ ∣∣ f (x)e−ixy∣∣ ∣∣∣1− e−ixh

∣∣∣dx

≤ supx : f (x) 6= 0∣∣∣1− e−ixh

∣∣∣‖ f‖1.

Since x : f (x) 6= 0 is a bounded subset of R, it follows that

limh→0

supx: f (x)6=0

∣∣∣1− e−ixh∣∣∣= 0.

In particular, f is bounded.

(ii) This is a straightforward calculation:

g(y) =∫

f (x)eiaxe−ixydx =∫

f (x)e−ix(y−a)dx = f (y−a).

(iii) By substitution t = x−a:

g(y) =∫

f (x−a)e−ixydx =∫

f (t)e−i(t+a)ydt = e−iay f (y).

(iv) For h := f ∗g, we have

h(y) =∫

f ∗g(y)e−ixydx =∫ (∫

f (x− t)g(t)dt)

e−ixydx

=∫ (∫

f (x− t)e−ixydx)

g(t)dt =∫ (∫

f (x)e−ixydx)

e−ityg(t)dt = f (y)g(y).

(v) By substitution t = x/λ :

g(y) =∫

f (x/λ )e−ixydx =∫

λ f (t)e−itλydt = λ f (λy).

(vi) Using partial integration:

f ′(y) =∫

f ′(x)e−ixydx = f (x)e−ixy∣∣∞−∞

+ iy∫

f (x)e−ixydx

129

5.4.4 The inversion formula and Plancherel’s theoremIt will turn out that the Fourier transform is after suitable normalization an isometry of L2(R).In order to prove that, we will first consider the following function. For λ > 0, let

hλ (y) :=∫

e−λ |x|eixydx.

The properties of this function are listed in the following proposition.

Proposition 5.4.4. For hλ , we have

hλ (y) =2λ

λ 2 +1, and

∫hλ (x)dx = 2π.

Furthermore, for f ∈ Lbc(R),

f ∗hλ (y) =∫

e−λ |x| f (x)eixydx, and limλ→0

f ∗hλ (y) = 2π f (y).

Proof. The first assertions follow by straightforward calculations:

hλ (y) =∫

e−λ |x|eixydx =∫ 0

−∞

eixy+λxdx+∫

∞

0eixy−λxdx

=1

ix+λ− 1

ix−λ=

2λ

λ 2 +1,∫

hλ (x)dx =2λ

∫ 1(x/λ )2 +1

dx = 2∫ 1

t2 +1dt = 2π.

The second assertion immediately follows from the definition of the convolution. For the lastassertion, note that

f ∗hλ (x)−2π f (x) =∫

f (x− y)hλ (y)dy−∫

f (x)hλ (y)dy

=∫

( f (x− y)− f (x))hλ (y)dy =∫

( f (x− y)− f (x))h1(y/λ )

λdy

=∫

( f (x−λy)− f (x))h1(y)dy

Since all functions in the integrand are bounded, and limλ→0( f (x−λy)− f (x)) = 0, the as-sertion follows by bounded convergence.

As a corollary of this proposition, we obtain the following theorem.

Theorem 5.4.5 (Inversion formula). For f ∈ Lbc(R) such that f ∈ Lbc(R), we have

f (x) =1

2π

∫f (y)eixydy.

Proof. By the above Lemma (and domoinated convergence),

2π f (x) = limλ→0

f ∗hλ (x) = limλ→0

∫e−λ |x| f (x)eixydx =

∫f (x)eixydx.

130

5.4.5 An application of the Fourier transform to partial differential equa-tions

Using the Fourier transfrom we are now able to solve the following partial differential equa-tion. Let f0 : R→ R, and f : R×R≥0 → R functions. In here, consider f0 : R→ R as theheat distribution on R at time 0, and we want to determine the change in time of this distribu-tion which is described by f : R×R≥0→ R. That is, that is f : R×R≥0→ R should be thesolution of

k ∂ 2 f (x,t)∂x2 = ∂ f (x,t)

∂ tf (x,0) = f0(x),

where k > 0 is some positive constant. For ease of notation, set fxx := ∂ 2 f (x,t)∂x2 , and ft := ∂ f (x,t)

∂ t .The partial differential equation with respect now can be written in a very compact form:

k fxx = ft , f (·,0) = f0.

In order to solve the equation, the same trick as in Section 5.3.4 applies. Namely, in orderto find a solution for f , the first step is to find a solution for the Fourier transform f (to beprecise: the Fourier transform with respect to the first coordinate), and then use the inverse ofthe transform to find f .

f (y, t) =∫

f (x, t)e−ixydx. (5.1)

So assume that we may interchange integration and differentiation. It follows that

( f )t(x, t) =∫

ft(x, t)e−ixydx (5.2)

=∫

k fxx(x, t)e−ixydx = k( fxx)(x, t).

By Proposition 5.4.3, part (vi), it follows that

( fxx)(x, t) =−x2 f (x, t). (5.3)

So, the partial differential equation is transformed into the ordinary differential equation( f )t(x, t) =−x2k f (x, t). The general solution for this equation now is given by

f (x, t) = g(x)e−kx2t ,

where g is some function only depending on x. Since f (x,0) = f0(x), it follows that g = f0.Hence,

f (x, t) = f0(x)e−kx2t ,

Using the inverse transform for e−kx2t , one obtains that ψ(x, t) = e−kx2t if and only if

ψ(x, t) =1

2√

πkte−

x24kt .

131

Using Proposition 5.4.3, part (iv), it follows that f = f0 ∗ψ = f0ψ . So a solution for the partialdifferential equation is given by

f = f0 ∗ψ.

We now have to fix the function spaces, where this approach is well defined: for Equation 5.1,one requires that f (·, t)∈ L2(R) for all t. Furthermore, Equation 5.1 holds for ft(·, t)∈ L1

bc(R),and fxx(·, t) ∈ L2(R). Finally, Equation 5.3 holds for fx(·, t), fxx(·, t) ∈ L1

bc(R).

5.5 The δ -functionThe δ -function is no function in the strict sense, and is used to model infinitesimal short burstsof size 1. In order to introduce this object, consider the following space of ‘test’-functions:

C∞c (R)= f : R→R : f is infinitely often differentiable, and there exists a,b∈R s.t. f = 0∀x /∈ [a,b].

Note that a,b ∈ R in the definition of C∞c (R) depend on f . Furthermore, for each f ∈C∞

c (R)with f = 0 on [a,b]c, it follows that f ′(x) = 0 for x ∈ [a,b]c. Hence, also f ′ ∈ C∞

c (R). Anexample for f ∈C∞

c (R) is given by (see Figure 5.6 for the graph)

f (x) :=

0 x≤ 0

e−1x e−

11−x x ∈ (0,1)

1 x≥ 0.

Furthermore, there is an associated type of convergence in this space of functions. Namely,

0 0,25 0,5 0,75 1

0,005

0,01

0,015

0,02

Figure 5.6: Graph of x 7→ e−1x e−

11−x on [0,1]

fn→ f in C∞c (R), if all derivatives converge uniformly to the derivatives of f , that is

supx∈R| f (k)

n (x)− f (k)(x)|

for all k ∈N∪0. This space now is used to define the space of generalized functions. Sincein the sequel, the elements of C∞

c (R) play the role of a variable, the greek letter ϕ will be usedfor elements of this space.

132

Definition 5.5.1. Denote by

C∞c (R)′ := Φ : C∞

c (R)→ R : Φ is linear and continuous.

In here, Φ ∈ C∞c (R)′ is called continuous, if convergence of ϕn to ϕ in C∞

c (R) implies thatlimΦ(ϕn) = Φ(ϕ). The space C∞

c (R)′ is called the dual of C∞c (R), and its elements are called

distributions or generalized functions.

As one might guess from the name generalized function, each reasonable function fromR to R should give rise to a generalized function. For f : R→ R the associated linear mapC∞

c (R)→ R is given by

Φ f : C∞c (R)→ R,ϕ 7→

∫f (x)ϕ(x)dx.

So it is left to identify a function space, such that Φ f is defined and continuous. So assumethat f ∈ L1(R), and that (ϕn) is a sequence in C∞

c (R) such that ϕn to some ϕ in C∞c (R). Then

ϕn to ϕ uniformly. Hence,

|Φ f (ϕn)−Φ f (ϕ)|=∣∣∣∣∫ f (x)(ϕn(x)−ϕ(x))dx

∣∣∣∣≤ ∫ | f (x)| |ϕn(x)−ϕ(x)|dx

≤ ‖ f‖1 supx∈R|ϕn(x)−ϕ(x)| n→∞→ 0.

So, Φ f is continuous for f ∈ L1(R), and that Φ f (ϕ) is finite follows by the same argument.9

The δ -function, sometimes also called Dirac-δ -function, is an important example for a gen-eralized function which is not defined via a function f : R→ R (it is really a generalizedfunction!).

Definition 5.5.2 (δ -function). Let x0 ∈ R. The δ -function δx0 is defined as the element ofC∞

c (R)′ given byδx0(ϕ) := ϕ(x0), for f ∈C∞

c (R).

By abuse of notation (or respectively by interpretation of δx0 as a probability measure whichis concentrated on the point x0 ∈ R), one often also uses the notation

δx0(ϕ) =∫

ϕ(x)δ (x− x0)dx.

Note that for a differentiable function f , and ϕ ∈C∞c (R), we have by partial integration that

Φ f ′(ϕ) =∫

f ′(x)ϕ(x)dx = f (x)ϕ(x)|∞−∞−∫

f (x)ϕ ′(x)dx =−∫

f (x)ϕ ′(x)dx. (5.4)

In here, the first summand is equal to zero, since ϕ = 0 on the complement of some interval[a,b]. This motivates the following general definition. For a generalized function Φ, let

Φ′(ϕ) :=−Φ(ϕ ′), for ϕ ∈C∞

c (R).

9In fact, it is sufficient to require that f is locally integrable - that is, for all a,b ∈ R,∫ b

a | f |dx < ∞.

133

By construction of C∞c (R), the usual derivative ϕ ′ is again an element of C∞

c (R). Furthermore,by definition of convergence in C∞

c (R), the derivative ϕ → ϕ ′ seen as a linear endomorphismof C∞

c (R) is continuous10. With Φ′ referring to the derivative of the generalized functionΦ, we obtain that the space C∞

c (R)′ is infinitely differentiable with respect to this new, moregeneral definition.

Example 53. (i) The Heaviside function H : R→ R is defined by

H(x) :=

1 x≥ 00 x < 0.

For a function ϕ ∈C∞c (R) it follows that

Φ′H(ϕ) =−ΦH(ϕ ′) =−

∫∞

0ϕ′(x)dx =− lim

t→∞(ϕ(t)−ϕ(0)) = f (0).

Hence the derivative of the Heaviside function is δ0.

(ii) From Equation 5.4 it follows that Φ f for some differentiable function f is given by Φ f ′ .

(iii) Assume that f is a piecewise continuously differentiable function: there exist t1 < t2 <· · · tk such that f restricted to each of the intervals (−∞, t1], [t1, t2], . . . [tk,∞) is differen-tiable. Then

Φ′f (ϕ) =−Φ f (ϕ ′) =−

∫ t1

−∞

f (x)ϕ ′(x)dx−∫ t2

t1f (x)ϕ ′(x)dx−·· ·−

∫∞

tkf (x)ϕ ′(x)dx

=∫ t1

−∞

f ′(x)ϕ(x)dx+∫ t2

t1f ′(x)ϕ(x)dx+ · · ·+

∫∞

tkf ′(x)ϕ(x)dx

=∫

f ∗(x)ϕ(x)dx = Φ f ∗(ϕ),

where f ∗(x) is a right continuous function defined by

f ∗(x) := limε→0,ε>0

f ′(x+ ε).

(iv) The derivative of δx0 is determined as follows. By definition,

δ′x0

(ϕ) :=−∫

ϕ′(x)δ (x− x0)dx =−ϕ

′(x0).

Hence, δ ′x0is the usual derivative at x0 for ϕ ∈C∞

c (R) seen as linear map from C∞c (R)

to R.

10This is a special feature of C∞c (R). In general, the differential operator is not continuous.

134

Bibliography

[Ar] Michael Artin. Algebra. Prentice Hall, 1991.

[De] Anton Deitmar. A first course in harmonic analysis. Second edition. Universitext.Springer-Verlag, New York, 2005.

[DKLM] Frederik Michel Dekking, Cornelis Kraaikamp, Hendrik Paul Lopuhaa, Ludolf Er-win Meester. A Modern Introduction to Probability and Statistics: Understanding Whyand How. Springer-Verlag London Limited, 2005.

[RHB] K.F. Riley, M.P. Hobson and S.J. Bence. Mathematical methods for physics and engi-neering. Cambridge University Press, 2006.

[Ta] Kwong-Tin Tang. Mathematical Methods for Engineers and Scientists 3: Fourier Analy-sis, Partial Differential Equations and Variational Methods. Springer-Verlag Berlin Hei-delberg, 2007.

135

Index

δ -function, 121

affine space, 17alternative, 89augmented matrix, 26

Banach space, 99basis

orthogonal, 14definition, 11orthonormal, 14standard basis, 11

Bayes’ rule, 62bias, 82

Cauchy-Schwarz-inequality, 13central limit theorem (Levy), 79characteristic polynomial, 37Cochran, theorem, 83combination, 58

with repetitions, 59without repetitions, 58

complement, 53completeness, 99conditional probability, 61Confidence interval

for σ , µ known, 87asymptotic, 88for µ , σ known, 86for µ , σ unknown, 86for σ , µ unknown, 87

convergencein Lp, 96pointwise, 95uniform, 95

cosine rule, 13countable, 54

countable set, 54

density function, 68derivative of a gen. function, 122determinant

axiomatic def., 32computational def., 30expansion along a column, 30expansion along a row, 30via row operations, 33

diagonalizabledef., 37

difference (set-theoretic), 53dimension, 11dimension formula, 16Dirac-δ -function, 121Dirichlet kernel, 115discrete probability space, 55distance, 13distribution

Bernoulli, 63binomial, 63hypergeometric, 64multinomial, 64Poisson, 65

distribution function, 67

eigenspace, 38generalized eigenspace, 48

eigenvalue, 37eigenvector, 37elementary matrix, 20endomorphism, 34Euler’s formula, 104event, 55

F-test, 93

136

Fejer-Kernel, 116Fourier coefficients

complex, 104real, 104

Fourier seriesin L2([0,L],C)), 104in L2([0,L],R)), 104

general linear group, 44Gram–Schmidt process, 43group, 46

homogenous system of lin. eq., 25hypothesis, 89

i.i.d., 78identity matrix, 19image of a lin. operator

basis, 25definition, 15

infimum, 95inhomogenous system of lin. eq., 25inner product, 12inverse of a matrix

computation, 29definition, 20

Jordan matrix, 47Jordan normal form, 47

kernel of a lin. operatorbasis, 27definition, 15

law of large numbers, 78length, 13linear combination, 8linear independence, 9linear operator, 14

matrix product, 18mean, 81multinomial coefficient, 60multiplicity

algebraic, 38geometric, 38

norm, 13, 97normed vector space, 97

orthogonal, 13orthogonal group, 44orthogonal matrix, 44orthonormal, 14

p-value, 90pairwise disjoint, 54partition, 61permutation, 57

with repetitions, 57without repetitions, 57

probability measurecontinuous, 68discrete, 55on R, 67

Pythagorean theorem, 13

radioactive decaydiscrete model, 56

random variable, 72rank of a matrix, 24row echelon form, 22

sample space, 55self-adjoint operator, 42Separability, 99significance, 89span, 9standard scalar product, 13, 19Stirling’s formula, 60subspace, 9supremum, 95

t-test, 91bivariate, 92one-sided, 91two-sided, 91univariate, one-sided, 91

transpose, 18

uniform sample space, 57unitary group, 44unitary matrix, 44

137

vector space, 7

weak convergence, 79

138

esm2a and esm2b course notes - jacobs university bremen · 1.2 esm2b - fourier analysis in...

Documents