probability theory ii - mi.uni-koeln.depgracar/files/wt2.pdf · probability theory ii dr. peter...

108
Probability theory II Dr. Peter Gracar Universit¨ at zu K¨ oln Version of January 30, 2020 Based on lecture notes of Prof. Dr. Alexander Drewitz If you spot any typos or mistakes, please drop an email to [email protected].

Upload: others

Post on 30-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Probability theory II

Dr. Peter GracarUniversitat zu Koln

Version of January 30, 2020Based on lecture notes of Prof. Dr. Alexander Drewitz

If you spot any typos or mistakes, please drop an email to [email protected].

2

Contents

1 Conditional probabilities and expectations revisited 5

1.0.1 Basic properties of conditional expectations . . . . . . . . . . . . . . . . . 8

2 Martingales 13

2.0.1 Discrete stochastic integral . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.0.2 Doob decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.0.3 Quadratic variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 Stopping times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.1 Inequalities for (sub-)martingales . . . . . . . . . . . . . . . . . . . . . . . 28

2.2 A first martingale convergence theorem . . . . . . . . . . . . . . . . . . . . . . . 29

2.3 Uniform integrability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.3.1 Convergence theorems for uniformly integrable martingales . . . . . . . . 40

2.3.2 Square integrable martingales . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.3.3 A central limit theorem for martingales . . . . . . . . . . . . . . . . . . . 42

2.4 Backwards martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.5 A discrete Black-Scholes formula . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.6 A discrete Ito-formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3 Markov processes 53

3.1 Interlude: Regular conditional distributions . . . . . . . . . . . . . . . . . . . . . 54

3.1.1 Properties of transition kernels . . . . . . . . . . . . . . . . . . . . . . . . 59

3.2 Back to Markov processes again . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.2.1 Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.3 Martingales and the Markov property . . . . . . . . . . . . . . . . . . . . . . . . 73

3.4 Doob’s h-transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4 Stationarity & ergodicity 79

4.1 Ergodic theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.1.1 Subadditive ergodic theorem . . . . . . . . . . . . . . . . . . . . . . . . . 86

5 Brownian motion and random walks 91

5.1 Another constructions of Brownian motion . . . . . . . . . . . . . . . . . . . . . 91

5.1.1 Levy’s construction of Brownian motion . . . . . . . . . . . . . . . . . . . 91

5.2 Some properties of Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2.1 Law of the iterated logarithm . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2.2 Non-differentiability of Brownian motion . . . . . . . . . . . . . . . . . . . 98

5.2.3 The strong Markov property of Brownian motion . . . . . . . . . . . . . . 99

5.2.4 The zero set of Brownian motion . . . . . . . . . . . . . . . . . . . . . . . 101

5.3 Local central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

3

A Alternative constructions and proofs 107A.1 Conditional expectation: a different approach . . . . . . . . . . . . . . . . . . . . 107A.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4

Chapter 1

Conditional probabilities andexpectations revisited

We have seen in [Dre19, Definition 3.2.1] how to define the conditional probability

P(A |B) :=P(A ∩B)

P(B)

for A,B ∈ F with P(B) > 0. In particular, for any such B, the function

F 3 A 7→ P(A |B)

defines a probability measure on F . We can therefore also define the conditional expectation ofa real random variable X (if it exists) as

E[X |B] :=E[X1B]

P(B)=

∫BX dPP(B)

=

∫ΩX dP(· |B), (1.0.1)

where the latter equality follows using algebraic induction.

Example 1.0.1. Consider the example of the random variable X modeling a fair dice roll. Thefairness assumption means

P(X = i) =1

6∀i ∈ 1, 2, . . . , 6.

Thus, the expected value of the dice roll is E[X] = 72 .

Now assume that for whatever reason you do get hold of the insider information that the eventX ≥ 4 will occur – what is the expected value of the dice roll conditionally on this event?According to the above definition we have with B := X ≥ 4 that

E[X |B] =

∫BX dPP(B)

=

∑6i=4 i

16

12

= 5,

which coincides with our intuition.

Similarly to the interpretation of the common expectation, the conditional expectation can beseen as a weighted average of X restricted to B. It turns out, however, that we would like tohave a definition for the conditional expectation and probability not only in the case P(B) > 0,but also for events B ∈ F for which we have P(B) = 0.

Example 1.0.2. Consider the situation of two independent Gaussian random variables X andY distributed according to N (0, 1), and set Z := X+Y. In this case, if for some reason we knowthe value of the realisation of X but not that of Y , we would intuitively like to have that

5

P(Z ∈ · |X = x) (1.0.2)

makes sense, and that it would denote the law N (x, 1) – however, the event X = x hasprobability 0 under P, so our common definition of conditional probability does not apply. Inwhat follows we will introduce the machinery to fix this shortcoming. We will, however, startwith introducing the conditional expectation in this generalised setting first and then considerthe special case of indicator functions in order to obtain the notion of conditional probability.

Definition 1.0.3. Let G be a sub-σ-algebra of F , and let X ∈ L1 be a real random variable on(Ω,F ,P). A random variable Y defined on (Ω,F ,P) is called the conditional expectation of Xwith respect to G, if

(a) Y is G − B(R)-measurable, and

(b) for any A ∈ G,E[X1A] = E[Y 1A].

In this case we also write E[X | G] := Y. Furthermore, for F ∈ F we write

P(F | G) := E[1F | G]

for the conditional probability of F given G.

Remark 1.0.4. It is possible to define the conditional expectation also for a non-negative randomvariable that is not necessarily integrable. We will usually restrict ourselves to the case wherethe random variable X is integrable (as in Definition 1.0.3); we will however prove the existenceof conditional expectation also for the non-integrable case below in Theorem 1.0.5.

We will see in Exercise 1.0.8 below that the conditional expectation introduced in Definition1.0.3 can actually be interpreted as a generalisation of the one introduced in (1.0.1). On theother hand, in Corollary 2.2.9 later we will learn how in a multitude of situations one obtainsthat the general conditional expectation of Definition 1.0.3 can be approximated by the basicconditional expectation introduced in (1.0.1).It turns out that conditional probabilities and expectations exhibit properties to those knownfrom common probabilities and expectations. Before making this more precise, however, weestablish their existence and almost sure uniqueness.

Theorem 1.0.5. [Existence and uniqueness of conditional expectations] For X ∈ L1(Ω,F ,P)or X ∈ L0(Ω,F ,P) and non-negative, and any sub-σ-algebra G ⊂ F , the conditional expectationE[X | G] exists and is unique P-a.s.

Proof. We begin by proving the statement for X integrable. The two maps

µ+ : G 3 G 7→ E[X+1G] µ− : G 3 G 7→ E[X−1G]

define two non-negative finite measures on (Ω,G), which are both absolutely continuous withrespect to P. Therefore, the Radon-Nikodym theorem [Dre19, Corollary 2.2.17] provides us withG − B(R)-measurable densities Y +, Y − ∈ L1(Ω,G,P) such that

µ±(G) =

∫GY ± dP.

As a consequence, setting Y := Y +−Y − has the properties of the conditional expectation of Xgiven G.Uniqueness:

6

Assume yet another conditional expectation Z of X given G to be given. Then Y > Z, Y <Z ∈ G, so by definition of the conditional expectation,

0 = E[X1Y >Z]− E[X1Y >Z] = E[Y 1Y >Z]− E[Z1Y >Z] = E[(Y − Z)1Y >Z].

Using [Dre19, Lemma 2.2.3], this implies that P((Y − Z)1Y >Z > 0) = 0 and therefore P(Y >Z) = 0; similarly we obtain P(Y < Z) = 0, so P(Y = Z) = 1, which finishes the proof for Xintegrable.For X non-negative, but not necessarily integrable, define first Xn = minn,X. As Xn isbounded, the conditional expectation Yn := E[Xn | G] exists by the first part of the proof.Furthermore, notice that Y0 = X0 = 0 and that for any n, the event G = Yn+1 < Yn ∈ G. Itfollows that

E[1G(Yn − Yn+1)] = E[1G(Xn −Xn+1)] ≤ 0

and from that Yn+1 ≥ Yn a.s.Setting Y = supn Yn the monotone convergence theorem gives

E[1GY ] = limn→∞

E[1GYn] = limn→∞

E[1GXn] = E[1GX]

as required.Finally, suppose that Z is also a non-negative G-measurable random variable satisfying thedefinition of conditional expectation. Then, for any x ∈ R, setting G = Y > Z, x > Y makes1GY and 1GZ bounded and

E[1G(Y − Z)] = E[1GX]− E[1GX] = 0,

so that P(G) = 0. Letting x go to infinity, we obtain that almost surely Z ≥ Y . Reversing theroles of Y and Z gives that Y = Z a.s.

According to the above result, conditional expectations are only uniquely defined P-almost surely,and any random variable fulfilling the properties of the conditional expectation will be calleda version of the conditional expectation. Therefore, all (in-)equalities involving conditionalexpectations will always be understood in a P-a.s. sense in what follows.Note also that in the case the random variable X is non-negative, but possibly not integrable,we must allow E[X | G] in the statement of the theorem to take the values in R+ ∪ ∞. As wewill see shortly, we can omit the second possibility precisely when X is integrable.It should be noted that one can prove the existence of conditional expectation also withoutrelying on the Radon-Nikodym theorem. Such a construction relies on knowledge of function-analytical tools; a short proof can be found in the appendix (see Theorem A.1.1 and Corol-lary A.1.2).

Definition 1.0.6. For random variables X,Y defined on (Ω,F ,P) and such that X ∈ L1,whereas Y can map to an arbitrary measurable space (E, E), we define

E[X |Y ] := E[X |σ(Y )].

Remark 1.0.7. Note that there exists a measurable mapping ϕ : (E, E)→ (R,B(R)) such that

E[X |Y ] = ϕ Y.

In fact, this map ϕ not only helps to define the conditional E[X |Y ] as a random variable from Ωto R, but it will help us to define the so-called regular conditional probability (‘regulare bedingteWahrscheinlichkeit’), where in particular we want to be able to interpret expressions such as thatin (1.0.2) as measurable functions in x.

7

Exercise 1.0.8. (a) Show that in the setting of Example 1.0.2 we have

E[Z |X] = X,

so ϕ from the previous remark equals the identity mapping from R to R.

(b) For B ∈ F with P(B) > 0 we can recover the conditional expectation E[X |B] definedin (1.0.1) from our new definition by considering E[X |σ(B)]. Then the latter is P-a.s.constant and equal to E[X |B] on B.

More generally, for a finite partition P of Ω with P ⊂ F , and X ∈ Lp some p ∈ [1,∞),we have

E[X |σ(P)] =∑A∈P

P(A)>0

E[X |A]1A.

1.0.1 Basic properties of conditional expectations

Theorem 1.0.9 (Properties of conditional expectations). Let X,Y ∈ L1 and A ⊂ G be sub-σ-algebras of F . Furthermore, let c ∈ R. Then P-a.s.:

(a) For c ∈ R,E[cX + Y | G] = cE[X | G] + E[Y | G]. (linearity) (1.0.3)

(b) If X ≤ Y , thenE[X | G] ≤ E[Y | G]. (monotonicity)

(c)|E[X | G]| ≤ E[|X| | G]. (triangle inequality)

(d) For Xn ∈ L1, n ∈ N, with Xn ↑ X P-a.s. as n→∞,

E[Xn | G] ↑ E[X | G] P−a.s. (Monotone Convergence theorem for conditional expectations)

(e) If XY ∈ L1 and X is G − B(R)-measurable, then

E[XY | G] = XE[Y | G]. (1.0.4)

(f)E[E[X | G] | A] = E[E[X | A] | G] = E[X | A]. (tower property)

(g) If σ(X) and G are independent, then

E[X | G] = E[X]. (1.0.5)

(h)E[X | Ω, ∅] = E[X].

Proof.

(a) The right-hand side of (1.0.3) is G − B-measurable, and furthermore, due to the linearityof the expectation and the characterising properties of the conditional expecation, we havefor any G ∈ G that

E[(cE[X | G] + E[Y | G]

)1G]

= cE[E[X | G]1G

]+ E

[E[Y | G]1G

]= cE[X1G] + E[Y 1G] = E[(cX + Y )1G],

which shows that the desired equality.

8

(b) Define the set G := E[X | G] > E[Y | G] ∈ G. Then due to the characterising propertiesof the conditional expectation as well as its linearity,

E[ (

E[X | G]− E[Y | G])1G︸ ︷︷ ︸

=:Z

]= E

[(E[X − Y | G]

)1G]

= E[(X − Y )1G

]≤ 0,

where the last inequality follows from the assumption that X ≤ Y. Since Z ≥ 0 bydefinition, the above implies Z = 0 P-a.s., and hence E[X | G] ≤ E[Y | G].

(c) This follows from the linearity and monotonicity of the conditional expectation in combi-nation with X = X+ −X− ≤ X+ +X− = |X|.

(d) Using the monotonicity of conditional expecations we infer that the monotone limit

Z := limn→∞

E[Xn | G]

exists P-almost surely, and Z is G−B-measurable with Z ≤ E[X | G]. Furthermore, for anyG ∈ G we have by applying the standard Monotone Convergence theorem twice that∫

ΩZ1G dP = lim

n→∞

∫ΩE[Xn | G]1G dP = lim

n→∞

∫ΩXn1G dP =

∫ΩX1G dP.

Therefore, Z satisfies the properties of a conditional expectation of X with respect to G,and using the uniqueness result Theorem 1.0.5, we infer Z = E[X | G].

(e) Using the linearity of conditional expectation, w.l.o.g. we can restrict ourselves to thecase that X,Y ≥ 0. By [Dre19, Lemma 2.0.10] there exists a monotone sequence (Xn) ofsimple non-negative function (Xn) such that Xn → X. In combination with the MonotoneConvergence Theorem for conditional expectations we infer the second convergence in

XnE[Y | G] ↑ XE[Y | G], E[XnY | G] ↑ E[XY | G],

and thus it will be sufficient to prove the result for the case that X is a simple function.In this case, we can furthermore reduce it to the case X = 1G, G ∈ G due to the linearityof conditional expectations. In this case we obtain for any A ∈ G that∫

A1GE[Y | G] dP =

∫A∩G

E[Y | G] dP =

∫A∩G

Y dP =

∫A1GY dP,

where we used A∩G ∈ G and the defining properties of the conditional expectation in thesecond equality. Using the uniqueness result Theorem 1.0.5, we infer the desired equation.

(f) The second equality is a consequence of the assumption A ⊂ G and (1.0.4).

Now for arbitrary A ∈ A we have∫ΩE[E[X | G] | A]1A dP =

∫ΩE[X | G]1A dP =

∫ΩX1A dP,

where in the first and second equality we used the definition of the conditional expectation(and in the second equality also the fact that A ⊂ G). Using the uniqueness result inTheorem 1.0.5, we infer

E[E[X | G] | A] = E[X | A]

as desired.

(g) Exercise.

(h) Exercise.

9

Note that the proof of property (d) works also when the random variables are not integrable butinstead non-negative. Observe furthermore that by the tower property (f), for X non-negativebut not necessarily integrable we have with A = Ω

E[E[X | G]] = E[X],

which implies that X is integrable precisely when E[X | G] is integrable.

We will now turn to deriving further properties that are known for common probabilities andexpectations, and extend them to conditional probabilities and expectations.

Theorem 1.0.10 (Dominated convergence theorem). Let G be a sub-σ-algebra of F , where(Ω,F ,P) is the underlying probability space. Furthermore, let (Xn) be a sequence in L1 and letY ∈ L1 such that |Xn| ≤ |Y | for all n ∈ N. Then, if Xn → X P-a.s. for some random variableX, we have

limn→∞

E[Xn |G] = E[X | G] P− a.s.

Proof. Set

Un := infm≥n

Xm ≤ Xn ≤ supm≥n

Xm =: Vn.

Then, since −|Y | ≤ Un ↑ X ≤ |Y |, the MCT for conditional expectations supplies us with

E[Un | G] ↑ E[X | G].

Similarly, since −|Y | ≤ Vn ↓ X ≤ |Y |, the MCT for conditional expectations supplies us with

E[Vn | G] ↓ E[X | G].

Combining this with E[Un | G] ≤ E[Xn | G] ≤ E[Vn | G] (which follows from the monotonicityproperty of conditional expectations) supplies us with

E[Xn | G]→ E[X | G],

which finishes the proof.

Theorem 1.0.11 (Conditional Jensen’s inequality). Let I ⊂ R be an interval and ϕ : I 7→ Ra convex function. Then if X ∈ L1(Ω,F ,P) such that X(Ω) ⊂ I, if ϕ(X) ∈ L1, and if G is asub-σ-algebra of F ,

ϕ(E[X | G]) ≤ E[ϕ(X) | G] P− a.s.

Proof.

If ϕ is an affine function (i.e., ϕ(x) = ax + b, for constants a, b ∈ R), then due to X ∈ L1, thelinearity of the conditional expectation (Theorem 1.0.9 (a)) supplies us with

ϕ(E[X | G]) = E[ϕ(X) | G].

If ϕ is not affine, we consider the countable space of affine functions

V :=f : R 3 x 7→ ax+ b : a, b ∈ Q, ϕ(x) ≥ f(x)∀x ∈ R

.

Since ϕ is convex we have the identity

ϕ(x) = supf(x) : f ∈ V , (1.0.6)

10

and we note that as a countable supremum of measurable functions, the left-hand side is mea-surable again.Furthermore, Theorem 1.0.9 Parts (b) and (a) imply that for f ∈ V we have

E[ϕ(X) | G] ≥ f(E[X | G]) P− a.s.

Taking the (countable) supremum over f ∈ V in this inequality yields

E[ϕ(X) | G] ≥ ϕ(E[X | G]) P− a.s.

Remark 1.0.12. With Theorem 1.0.11 at hand, the triangle inequality for conditional expecta-tions would also follow immediately by setting I = R and ϕ = | · |.

The following result admits a very nice and intuitive interpretation of the conditional expectationfor a large class of random variables. In geometric terms, it states that E[X | G] is the orthogonalprojection of X ∈ L2(Ω,F ,P) onto the closed subspace L2(Ω,G,P) of L2(Ω,F ,P).

Theorem 1.0.13. Let X ∈ L2(Ω,F ,P) and let G ⊂ F be a sub-σ-algebra of F . Then thefunctional

L2(Ω,G,P) 3 Y 7→ E[(Y −X)2] ∈ [0,∞)

is minimised if and only if Y = E[X | G].

Proof.We start by noting that E[X | G] ∈ L2 due to

E[(E[X | G]

)2] ≤ E[E[X2 | G]

]= E[X2] <∞,

where we used the conditional Jensen inequality in the first inequality, the tower property inthe equality, and the assumption X ∈ L2 in the last inequality.Noting that

E[(X − E[X | G]) · (E[X | G]− Y︸ ︷︷ ︸

:=Z

)]

= 0

by Theorem 1.0.9 (f) (since Z is G − B-measurable) we rewrite

E[(X − Y )2] = E[(X − E[X | G])2 + Z2

]= E

[(X − E[X | G])2

]+ E

[Z2]. (1.0.7)

This shows that the left-hand side is minimised if and only if Y = E[X | G].

As a generalisation of the characterisation of events A,B ∈ F with P(B) > 0 as independentvia the equation

P(A |B) = P(A),

and complementing the property of conditional expectations given in (1.0.5), we have a corre-sponding property for our generalised version of conditional expectations.

Proposition 1.0.14. Let A,G be sub-σ-algebras of F . Then A and G are independent if andonly if

P(A | G) = P (A) ∀A ∈ A. (1.0.8)

Proof. It follows from (1.0.5) that if A and G are independent, then (1.0.8) holds.Conversely, if (1.0.8) holds true, then for all A ∈ A, G ∈ G,

P(A ∩G) = E[1A1G] = E[P(A | G)1G

] (1.0.8)= P(A)P(G),

so A and G are independent.

11

12

Chapter 2

Martingales

We recall Definition 4.1.1 from [Dre19] of a stochastic process.

Definition 2.0.1. A stochastic process is a family (Xt), t ∈ T, of random variables mappingfrom some probability space (Ω,F ,P) into a measurable space (E, E). Here, T is an arbitrarynon-empty set.

An important class of processes is formed by those processes for which the index set T is totallyordered (e.g., T = R).In this course we will mostly be interested in the case where T = N0, T = N, or T = Z, and(E, E) = (R,B). This suggests that T will in fact often play the role of ‘time’. Therefore, wewill assume T = N0 as well as (E, E) = (R,B) in the following unless mentioned otherwise.

Definition 2.0.2. A family (Ft), t ∈ T is called a filtration (‘Filtration’, ‘Filtrierung’) if (Ft),t ∈ T , is an increasing family of σ-algebras on Ω with Ft ⊂ F for all t ∈ T.

Example 2.0.3. To a stochastic process (Xt), t ∈ T, we can associate the filtration (Ft), t ∈ T,where

Ft := σ(Xs : s ∈ T, s ≤ t).

It is not hard to check that this actually defines a filtration, and it is usually called the filtrationgenerated by the process (Xt).

If we interpret T as denoting time, then one interpretation is that Ft contains ‘all the informationthat is available at time t’. In particular, this makes sense if (Ft) is the canonical filtrationsince in this case Remark 1.0.7 tells us that any Ft − B-measurable function can be written asϕ(X1, . . . , Xt), some ϕ measurable from (Rt,B(Rt)) to (R,B).

Definition 2.0.4. Let (Ω,F ,P) be a probability space and (Ft), t ∈ T , be a filtration.The quadruple (Ω,F ,P, (Ft)t∈T )) is called a filtered probability space.A stochastic process (Xt), t ∈ T , is called adapted to (Ft) if for each t ∈ T, the random variableXt is Ft − B-measurable.Similarly, in the special case T = N0, a stochastic process (Xt), t ∈ N0, is called predictable(or previsible) (‘vorhersagbar’, ‘previsibel’) with respect to (Ft) if for each n ∈ N, the randomvariable Xn is Fn−1 − B-measurable.In this case it is also sometimes helpful to define

F∞ := σ( ⋃n∈N0

Fn). (2.0.1)

Example 2.0.5. (a) Any process (Xt) is adapted to the filtration Ft it generates. Further-more, the filtration generated by a process is minimal in the sense that for any otherfiltration Gt on (Ω,F ,P) to which it is adapted, one has Ft ⊂ Gt for all t ∈ T.

13

(b) If T = N0, any predictable process is in particular adapted.

Definition 2.0.6. Let (Xt), t ∈ T , be a stochastic process which is adapted to a filtration (Ft),t ∈ T . Then (Xt) is called a martingale (‘Martingal’) (with respect to (Ft))1 if

(a)

Xt ∈ L1 ∀t ∈ T ; (2.0.2)

(b)

E[Xt | Fs] = Xs ∀s, t ∈ T with s ≤ t; (2.0.3)

If ‘=’ in (2.0.3) is replaced by ‘≥’ (or ‘≤’), then (Xn) is called a ‘submartingale’ (or ‘super-martingale’).

Remark 2.0.7. In the case T = N0 (and for T = N, T = Z), on which we will focus for mostof this class, instead of checking (2.0.3) it is sufficient to show that

E[Xn+1 | Fn] = Xn ∀n ∈ T. (2.0.4)

Indeed, if (2.0.4) holds true, using the tower property we obtain

E[Xn+2 | Fn] = E[ E[Xn+2 | Fn+1]︸ ︷︷ ︸=Xn+1 by assumption

| Fn] = Xn ∀n ∈ T,

and inductively one infers

E[Xn+k | Fn] = Xn ∀k, n ∈ T,

i.e., the validity of (2.0.3).

The most interesting property in the definition of a martingale in Definition 2.0.6 is arguably(2.0.3). Recalling Theorem 1.0.13, which stated that for a random variable with finite secondmoment the conditional expectation corresponds to the orthogonal projection on the closedsubspace L2(Ω,G, P ) of L2(Ω,F , P ), it means that for s, t ∈ T with s < t, the best approximationto Xt which is Fs−B measurable, is Xs itself. This is the reason why martingales are also referredto as fair games: Using the present information, i.e., all Fs−B-measurable random variables, itis not possible to predict anything better about the future than Xs (such as e.g. that the valueat a future time would have a tendency to be higher or lower than at present).

The following can also be interpreted as a game where at each time step one is winning orlosing one unit with equal probability, and independently of everything else. Hence, a lot ofthe problems related to it contain the term ‘gambling’, such as e.g. the gambler’s ruin problem.We do currently not have the time to consider many of the interesting details concerning theseproblems, but hope to be able to return to it elsewhere.

Example 2.0.8. (a) Consider (Xn), n ∈ N, a sequence of independent random variables suchthat E[Xn] = 0 for each n ∈ N (in particular, this implies Xn ∈ L1), as well as thecanonical filtration characterised by Fn := σ(X1, . . . , Xn). Then with Sn :=

∑nj=1Xj, the

sequence (Sn) is a martingale with respect to the filtration Fn.In particular, we have Sn ∈ L1 since E[Xn] = 0 by assumption. Furthermore, since thefunction Rn 3 (x1, . . . , xn) 7→

∑nj=1 xj is continuous, we infer in combination with [Dre19,

Corollary 2.3.9] that Sn is Fn − B-measurable. Last but not least, observe that

E[Sn+1 | Fn]Thm. 1.0.9 (e) and (a)

= Sn + E[Xn+1 | Fn].

1For the origins of the word ‘martingale’, see http://www.jehps.net/juin2009/Mansuy.pdf.

14

But now σ(Xn+1) and Fn are independent due to [Dre19, Proposition 3.2.7], so Theorem1.0.9 (g) implies that E[Xn+1 | Fn] = E[Xn+1] = 0, which finishes the proof that (Sn) is amartingale. Therefore, in particular, we consider (Sn) to describe a fair game.

If (Sn) is a martingale, then the process defined via (Sn+1 − Sn) is usually referred to asa martingale difference sequence. The above already suggests that a martingale differencesequence can be interpreted as a generalisation of an i.i.d. sequence of random variables.

(b) Consider any random variable X ∈ L1(Ω,F ,P), and any filtration (Fn), n ∈ N. Then thesequence (Xn) defined via

E[X | Fn]

is a martingale with respect to (Fn). We will see below that under certain circumstanceswe may follow a converse procedure, and given a martingale (Xn) we will be able to obtaina random variable X such that E[X | Fn] = Xn.

(c) Let (Sn) be simple random walk and show that the process

cosh(σ)−n expσSn

is a martingale with respect to the canonical filtration.

(d) Describe all stochastic processes on −1, 0, 1 that are martingales. We note first that if(Xn) is a martingale, then it must satisfy

P(Xn+k = Xn ∀k ≥ 0 |Xn = 1) = 1.

If not, we have on Xn = 1 that E[Xn+1 | Fn] = P(Xn+1 = Xn |Xn = 1) − P(Xn+1 =−1 |Xn = 1) < 1 = Xn, so that Xn can’t be a martingale. Similarly we must have that

P(Xn+k = Xn ∀k ≥ 0 |Xn = −1) = 1.

Define now T = minn ∈ N0 ∪ ∞ : Xn ∈ −1, 1. Then a.s. Xn = 0, ∀n < T andXn = XT ,∀n ≥ T . The first case follows directly from the definition of T . Let nowP(Xn 6= XT ) > 0 for some n > T . Then, we have that either P(Xn+k = Xn ∀k ≥ 0 |Xn =1) < 1 or P(Xn+k = Xn ∀k ≥ 0 |Xn = −1) < 1, contradicting our earlier observation.

Therefore, we have that the vector (XT , T ) completely describes all possible martingales on−1, 0, 1. The sole constriction of this vector on N0 ∪ ∞ × −1, 1 is that

P(XT = 1 |T = n) = P(XT = −1 |T = n) =1

2∀n ∈ N.

To see why, note that on Xn = 0 it must hold, that E[Xn+1 | Fn] = 0. At the same time,we must have that

E[Xn+1 | Fn] = P(T = n+ 1, Xn+1 = 1)− P(T = n+ 1, Xn+1 = −1)

and therefore P(XT = 1, T = n) = P(XT = −1, T = n) = 12P(T = n).

Finally, we must show that every such pair represents a martingale. Let therefore (T,XT )be a vector satisfying the above restriction. The corresponding stochastic process is thenintegrable (since it is bounded from above and below by 1) and adapted (since all quantitiesneeded to consider Xn depend only on Xm for m ≤ n. Finally, E[Xn+1 | Fn] = Xn, sinceon T ≤ n we have that Xn+1 = Xn = XT almost surely by definition, on T = n + 1the restriction of (T,XT ) gives that E[Xn+1 | Fn] = 0 = Xn and on T > n + 1 we havethat Xn+1 = Xn = 0 almost surely.

15

Remark 2.0.9. (a) If we do not explicitly mention the filtration of a martingale (Xn), wewill usually assume that Fn := σ(X1, . . . , Xn).

(b) It might look puzzling at a first glance that we call the process a supermartingale if in factwe have that E[Xn+1 | Fn] is smaller than Xn. If, for example, we choose the i.i.d. variablesXn in Example 2.0.8 in such a way that P(X1 = 1) = p and P(X1 = −1) = 1 − p forsome p ∈ (0, 1)\1

2, then we call Sn a random walk with drift; we will choose p ∈ (0, 12)

for the rest of this remark for the sake of definiteness. Everything goes through as before,but we obtain E[Sn+1 | Fn] = Sn + (2p− 1) < Sn. Hence, in this case we see that (Sn) is asupermartingale; if for instance one interprets (Sn) as the evolution of a sequence of biasedcoin tosses such that for heads (say Xn = 1) one wins one unit in the n-th step, whereasif it shows tails (say Xn = −1) one loses one unit, then on average things are going downwhich is far from being super.

Indeed, the reason (Sn) is called a supermartingale in this case is the following. In analysis,a function f ∈ C2(Rd,R) is called harmonic if

∆f(x) = 0 ∀x ∈ Rd. (2.0.5)

2 In this case, it can be shown that for all r > 0 and x ∈ Rd, f satisfies the mean valueproperty

f(x) =1

|B(x, r)|

∫B(x,r)

f(y) dy =1

σ(∂B(x, r))

∫∂B(x,r)

f(y)σ(dy) (2.0.6)

where B(x, r) := y ∈ Rd : ‖x − y‖2 ≤ r. Furthermore, f is called superharmonic(subharmonic, respectively), if the equality in (2.0.5) is replaced by ≤ (≥, respectively). Inthis case, the equality in (2.0.6) can be replaced by ≥ (≤, respectively).

Exercise 2.0.10. Let ϕ ∈ C2(Rd,R) be subharmonic, and let (Xn) be an i.i.d. sequence ofrandom variables which are uniformly distributed on either ∂B(0, 1) or otherwise B(0, 1).

Show that with Sn :=∑n

i=1Xi the sequence ϕ(Sn) is a submartingale with respect to thecanonical filtration Fn := σ(X1, . . . , Xn).

Theorem 2.0.11. (a) Let (Xt) be a martingale with respect to a filtration (Ft) and ϕ : R→ Rbe a convex function. If

E[ϕ(Xt)+] <∞ ∀t ∈ T, (2.0.7)

then ϕ(Xt) is a submartingale with respect to (Ft).

(b) Let (Xt) be a submartingale with respect to a filtration (Ft) and ϕ : R → R be a convexand non-decreasing function. If (2.0.7) continues to hold, then ϕ(Xt) is a submartingalewith respect to (Ft).

Proof. (a) We start with noting that since (Xt) is a process adapted to (Ft), so is the pro-cess (ϕ(Xt)), since as a convex function ϕ is B − B-measurable (as follows e.g. from therepresentation (1.0.6)).

Furthermore, since ϕ is convex, we can find a, b ∈ R such ϕ(x) ≥ ax+ b for all x ∈ R, andso ϕ(x)− ≤ (ax+ b)−. Thus, we deduce

E[ϕ(Xt)−] ≤ E[(aXt + b)−] ≤ |b|+ |a|E[|Xt|] <∞.

2This is in fact consistent with the terminology for (discrete) harmonic functions from [Dre18], where for thetransition matrix P of a Markov chain we called a function f harmonic if (P − I)f = 0. The operator P − I(or the one given by limt↓0(P t − Id)/t in the context of continuous time) will be called the generator (see alsoDef. 3.3.1 below) in the context of Markov processes, and if time admits we will see that 1

2∆ is the generator of

Brownian motion.

16

In combination with (2.0.7), this implies

E[|ϕ(Xt)|] <∞, i.e., ϕ(Xt) ∈ L1

for all t ∈ T (cf. (2.0.2)).

Therefore, the conditional Jensen inequality of Theorem 1.0.11 supplies us with

E[ϕ(Xt) | Fs] ≥ ϕ(E[Xt | Fs]

)= ϕ(Xs), (2.0.8)

which finishes the proof.

(b) The proof proceeds in exactly the same way as that of the first part, with the equality in(2.0.8) replaced by ‘≥’ since ϕ is non-decreasing and E[Xt | Fs] ≥ Xs.

Exercise 2.0.12. Find an example which shows that part (b) of Theorem 2.0.11 does not gen-erally hold if one does without the assumption of ϕ being non-decreasing.

Proposition 2.0.13. (a) (Xn) is a submartingale if and only if (−Xn) is a supermartingale.

(b) If (Xn) and (Yn) are submartingales, then so is (Xn ∨ Yn).

Proof. (a) (Xn) has the desired integrability and measurability properties if and only if (−Xn)does. As regards to the conditional expectations, we have

E[Xt | Fs] ≥ Xs iff E[−Xt | Fs] ≤ −Xs

due to the linearity of conditional expectations.

(b) As a maximum of integrable and Fn−B-measurable random variables we have that Xn∨Ynis integrable and Fn − B-measurable. Furthermore, we have, using the monotonicity ofconditional expectation as well as the definition of a submartingale, for s, t ∈ T with s ≤ t,that

E[Xt ∨ Yt | Fs] ≥ E[Xt | Fs] ∨ E[Yt | Fs] ≥ Xs ∨ Ys.

2.0.1 Discrete stochastic integral

We will now get back to the gambling interpretation of martingales in more detail. Indeed, sincetimes at which one can trade at a stock exchange are discrete anyway, the notion of a discretestochastic integral suggests itself as a suitable candidate for modeling this situation. At eachtime t ∈ N the investor has to decide how much to invest in a certain stock from time t to t+ 1.The process describing this investment should be adapted to the canonical filtration of the stockprice, since the investor should not be able to take advantage of any information other than thepast up to the present of the stock prices – in particular, she should not be able to consider anyinformation on future stock prices in that decision.

Definition 2.0.14. Denote by (Xt), (Yt), t ∈ N0, two processes adapted to a filtration (Ft).Then the stochastic integral ((Y •X)n), n ∈ N0, is defined via (Y •X)0 := 0, and

(Y •X)n :=n−1∑k=0

Yk · (Xk+1 −Xk), n ∈ N.

If (Xt) is a martingale, then the process ((Y •X)n) is also referred to as a martingale transform.

17

It is easy to check that the stochastic integral defined in Definition 2.0.14 is adapted to thefiltration (Ft).

Theorem 2.0.15. (a) The process (Xt), t ∈ N0, from Definition 2.0.14 is a martingale ifand only if for any adapted process (Yt), t ∈ N0, such that each Yt is a bounded randomvariable, the stochastic integral ((Y •X)t), t ∈ N0 is a martingale.

(b) The process (Xt), t ∈ N0, from Definition 2.0.14 is a sub-/ supermartingale if and onlyif for any adapted process (Yt), t ∈ N0, with Yt ≥ 0 and Yt bounded for each t ∈ N0, thestochastic integral ((Y •X)t), t ∈ N0 is a sub-/ supermartingale, respectively.

Proof. (a) Let (Xt) be a martingale. Then we have

E[(Y •X)n+1 | Fn

]= E

[(Y •X)n + Yn(Xn+1 −Xn) | Fn

](1.0.4)

= (Y •X)n + Yn E[(Xn+1 −Xn) | Fn

]︸ ︷︷ ︸=0, since (Xn) martingale

= (Y •X)n,

so ((Y •X)n) is a martingale.

Conversely, assume that for any (Yn) as in the assumptions, ((Y • X)n), n ∈ N0, is amartingale. Then for any n ∈ N, setting Yn := 1 as well as Ym = 0 for m 6= n, we havethat

0 = (Y •X)n−1 = E[(Y •X)n | Fn−1] = E[Xn −Xn−1 | Fn−1] = E[Xn | Fn−1]−Xn−1,

which shows that (Xn) is a martingale (with respect to (Fn)).

(b) The proof is similar to that of the martingale case and is left as an exercise.

Example 2.0.16. Let (Xn) be an i.i.d. sequence of Rademacher distributed random variablesand set Sn :=

∑ni=1Xi, with S0 := 0. Then (Sn) defines a martingale (simple random walk) with

respect to the canonical filtration Fn := σ(S1, . . . , Sn) = σ(X1, . . . , Xn), and we can define anadapted process (Hn) via H0 := 1 and recursively

Hn := (H • S)n for n ≥ 1.

In particular, note that the definition of (H • S)n only requires us to know the values of (Hk),k ∈ 0, . . . , n−1, so Hn is well-defined. This stochastic integral describes the value of a portfoliofor which at each time step from n to n + 1 we are entirely invested, and the portfolio eitherdoubles in value (if Xn+1 = 1) or becomes worthless (if Xn+1 = −1), until we’re bankrupt.3 Theprevious result then tells us that the value of the portfolio is a martingale.

The interpretation of this Theorem 2.0.15 is as follows: If you invest in a process constituting amartingale, and if you do so by just using the information that is available at present, then youcan’t beat the system in the sense that the resulting portfolio will be a martingale again.Similarly, if you invest (non-negative capital) in a process constituting a sub- / supermartingale,then the resulting portfolio process will be a sub- / supermartingale again.Given this possibly disappointing result, you might still ask if you can leave the investment gameat a certain point in time (e.g. at the first time your capital exceeds a given amount) in orderto ‘beat the system’. We will investigate this question below in the section of stopping times;before, however, we introduce a representation of martingales which will be useful in what isabout to come.

3I’m not a financial mathematician, but I guess it seems unrealistic to assume that the value of a stock becomesnegative; so instead you might think of the value of a so-called future contract.

18

2.0.2 Doob decomposition

The following is the first theorem in our series of results attributed to Joseph Doob (see https:

//en.wikipedia.org/wiki/Joseph_L._Doob).

Theorem 2.0.17 (Doob decomposition). Let (Xt), t ∈ N0, be an adapted process such thatXt ∈ L1(Ω,F ,P) for all t ∈ N0.

(a) Then there exists a martingale (Mt), t ∈ N0, and a previsible process (At), t ∈ N0, withA0 = 0 such that

Xn = Mn +An, ∀n ∈ N0. (2.0.9)

Furthermore, (Mn) and (An) are P-a.s. uniquely determined via (2.0.9), i.e., if (Mt),t ∈ N0, is a martingale and (At), t ∈ N0, a previsible process with A0 = 0 such that

Xn = Mn + An, ∀n ∈ N0.

then P(An = An,Mn = Mn ∀n ∈ N0) = 1.

(b) Furthermore, (Xn) is a submartingale if and only if the process (An) is increasing, i.e.,P(An ≤ An+1) = 1 for all n ∈ N0.

Proof. (a) The canonical way to extract a martingale contribution from (Xn) is to define

Mn := X0 +n∑i=1

(Xi − E[Xi | Fi−1]), n ∈ N0, (2.0.10)

which by definition is a martingale with respect to (Fn). To make up for the fact that Mn

is generally not yet equal to Xn, we set

An := Xn −Mn, n ∈ N,

as well as A0 := 0, which in combination with (2.0.10) yields at once that (An) is previsible,and also the identity (2.0.9).

In order to show the P-a.s. uniqueness, we observe that for processes (Mn) and (An) as inthe assumptions, the process defined via

Mn − Mn = An − An

is a previsible martingale and, using A0 = A0 = 0, equal to 0 (exercise), which yields thedesired uniquness.

(b) Using the first part we have

E[Xn+1 −Xn | Fn

]= E

[Mn+1 +An+1 −Mn −An | Fn

]= An+1 −An,

which shows that (Xn) is a submartingale if and only if (An) is increasing.

2.0.3 Quadratic variation

Definition 2.0.18. Let (Mn) be a square integrable martingale, i.e., a martingale with Mn ∈ L2

for all n ∈ T. The previsible process (An) for which we have that A0 = 0 and that

(M2n −An)

is a martingale, is called the quadratic variation process (‘quadratischer Variationsprozess’) orangle bracket process of (Mn), and it is also denoted by (〈M〉n).

19

We observe that due to Theorem 2.0.17, the process (An) exists and is P-a.s. unique. Fur-thermore, due to Theorem 2.0.11, in the context of Definition 2.0.18, the process (M2

n) is asubmartingale and hence, using Theorem 2.0.17 again, the process (〈M〉n) is non-decreasing.Also, note that with (Mn) a square integrable martingale, for m,n ∈ N with m < n Cauchy-Schwarz supplies us with (Mn+1 −Mn)(Mm+1 −Mm) ∈ L1, so in combination with E[Mn+1 −Mn] = E[Mm+1 −Mm] = 0 Theorem 1.0.9 (e) immediately implies

E[(Mn+1 −Mn)(Mm+1 −Mm) | Fn] = (Mm+1 −Mm)E[(Mn+1 −Mn) | Fn] = 0,

and in particular, taking expectations in the above, we deduce that Mn+1−Mn and Mm+1−Mm

are uncorrelated.Furthermore, from the fact that (M2

n − 〈M〉n) is a martingale we infer that

〈M〉n =

n∑i=1

E[M2i | Fi−1]−M2

i−1 =

n∑i=1

E[(Mi −Mi−1)2 | Fi−1]. (2.0.11)

This justifies that (〈M〉n) can be interpreted as a pathwise measurement of the conditionalvariances of the trajectories of (Mn), and in this interpretation 〈M〉∞ := limn→∞〈M〉n is thetotal conditional variance along a path.Furthermore, from (2.0.11) we obtain by taking expectations

E[〈M〉n] =

n∑i=1

E[(Mi −Mi−1)2] = Var(Mn −M0). (2.0.12)

Example 2.0.19. Let X ∈ L2, and for a sub-σ-algebra G of F define the conditional varianceof X given G as

Var(X | G) := E[X2 − E[X | G]2 | G

].

For a filtration (Fn) we set Mn := E[X | Fn]. Then the claim is that

〈M〉n = Var(X)−Var(X | Fn−1)︸ ︷︷ ︸=:An

. (2.0.13)

W.l.o.g. assume that E[X] = 0. We start with observing that (An) defines a previsible processwith A0 = 0 a.s. Furthermore, using the (conditional) orthogonality of the increments in thefirst equality,

E[M2n+1 −An+1 | Fn]

= E[(Mn+1 −Mn)2 +M2n − (Var(X)−Var(X | Fn)) | Fn]

= E[(E[X | Fn+1]− E[X | Fn])2 − (Var(X | Fn−1)−Var(X | Fn) | Fn]

+M2n − (Var(X)−Var(X | Fn−1))︸ ︷︷ ︸

=An

= M2n −An.

2.1 Stopping times

It will turn out useful to consider stochastic processes not only at fixed or deterministic times,but also at random times. As a motivation one might consider the investing interpretation ofthe discrete stochastic integral again. As an example, you can consider the strategy that sellsthe stock once it exceeds a certain value for the first time. We will see later on (see the optionalstopping theorem, Theorem 2.1.14 below) that, under suitable assumptions, if the underlyingprocess is a martingale, one cannot ‘beat’ the market using such strategies.In order to have a somehow compatible notion of random times, we introduce the followingdefinition.

20

Definition 2.1.1. A random variable τ : (Ω,F ,P) → T ∪ ∞ is called a stopping time (withrespect to a filtration (Ft), t ∈ T ), if for each t ∈ T,

τ ≤ t ∈ Ft. (2.1.1)

Intuitively, this means that in order to decide whether or not the process is ‘stopped’ by timet, you are only allowed to take advantage of all the information that is available at time t; inparticular, you’re not allowed to peek into the future!

The condition given in (2.1.1) is to be interpreted as follows: If we consider T as time, then(2.1.1) says that in order to decide whether or not τ ≤ t occurs, we only need to know the‘information’ provided by Ft. In the important case of Ft = σ(Xs : s ≤ t) this means thatwhether or not τ ≤ t occurs can be judged just on the basis of the Xs, s ≤ t.

Example 2.1.2. (a) For any process (Xt) adapted to some filtration (Ft), for any t ∈ T, theconstant random variable τ := t is a stopping time.

(b) Consider random walk (Sn) as in Example 2.0.8, but with the assumption E[Xn] = 0replaced by E[Xn] ∈ R\0 and Var(Xn) > 0. For A ∈ B we can define the so-called firstentrance time to the set A via

τA := infn ∈ T : Sn ∈ A.

Then τA is a stopping time. Indeed, we have for any n ∈ N0 that

τA ≤ n =n⋃k=0

Sk ∈ A︸ ︷︷ ︸∈Fk⊂Fn

,

which proves that τA is a stopping time.

Now consider the case A = (1,∞). Show that

(i) if E[Xn] > 0, then P(τA <∞) = 1;

(ii) if E[Xn] < 0, then P(τA =∞) > 0.

A popular interpretation is that of eSt modeling the price of a stock over time. In particular,note that under suitable integrability assumptions (such as assuming P(Xn = 1) = P(Xn =−1) = 1

2 for the underlying random variables for the sake of simplicity), due to Theorem2.0.11 the stock price (eSt), t ∈ T, is a submartingale. This corresponds to the fact thatin the long run stocks are expected to outperform investments in the money market sincethey are more risky, and investors have to have a benefit for taking risks.

Lemma 2.1.3. A random variable τ : (Ω,F ,P)→ T ∪∞ is a stopping time if and only if forall n ∈ T (recall that T is assumed to be discrete) we have

τ = n ∈ Fn ∀n ∈ T.

Proof. Exercise

Exercise 2.1.4. Show by giving a counterexample that the equivalence given in Lemma 2.1.3does not apply if T is chosen more generally as before (such that T = [0,∞)).

Consider the following setting: Ω = [0, 1], F = B([0, 1]), and let T ⊂ [0, 1] be such that T /∈ F .Furthermore, set Xt = id : Ω→ Ω for all t ∈ T and consider the filtration given by Ft = F forall t ∈ T. Define the random variable τ := inft ∈ T : Xt = t. Then:

• The process (Xt), t ∈ T , is adapted;

21

• The random variable τ is a not a stopping time: Indeed, we have

τ ≤ 1 = T /∈ F1.

•τ = t = t ∈ Ft ∀t ∈ T.

Lemma 2.1.5. Let τ and σ be two stopping times.

(a) The minimum σ ∧ τ and the maximum σ ∨ τ are stopping times.

(b) If e.g. T = Z, or T = N0, then σ + τ is a stopping time.

Proof. (a) We haveσ ∧ τ ≤ n = σ ≤ n︸ ︷︷ ︸

∈Fn

∪τ ≤ n︸ ︷︷ ︸∈Fn

,

and hence the right-hand side is in Fn.

(b) We have

σ + τ ≤ n =⋃

j∈T,j≤n

⋃k∈T,k≤j

σ = j − k ∩ τ = k,

and all the sets on the right-hand side are contained in Fn due to Lemma 2.1.3 whichproves the claim.

Exercise 2.1.6. Give an example to show that in the setting of the above lemma, τ − σ isgenerally not a stopping time.

Definition 2.1.7. Let τ be a stopping time with respect to a filtration (Ft)t∈T . We can definethe σ-algebra

Fτ :=F ∈ F : F ∩ τ ≤ n ∈ Fn

,

which is commonly referred to as the σ-algebra of the τ -past (‘σ-Algebra der τ -Vergangenheit’).

Exercise 2.1.8. Show that Fτ as defined above actually defines a σ-algebra.

Remark 2.1.9. Note that, using Lemma 2.1.3 we immediately infer that for F ∈ Fτ we alsohave that F ∩ τ = n ∈ Fn for all n ∈ T.

One of the principal reasons for introducing it is the formulation of the optional sampling theorembelow. We will take advantage of the following simple properties of Fτ .

Lemma 2.1.10. Let σ and τ be stopping times.

(a) If σ ≤ τ , thenFσ ⊂ Fτ .

(b)Fσ∧τ = Fσ ∩ Fτ .

Proof. (a) Let F ∈ Fσ. Then, since σ ≤ τ, we have

F ∩ τ ≤ t = (F ∩ σ ≤ t)︸ ︷︷ ︸∈Ft

∩τ ≤ t︸ ︷︷ ︸∈Ft

∈ Ft,

which proves the desired inclusion.

22

(b) Exercise

One of the main reasons for introducing the σ-algebra of the τ -past, which looks slightly unnat-ural and technical at a first glance, is given by the following lemma.

Lemma 2.1.11. Let (Xt) be an adapted process and let τ < ∞ be a stopping time. Thenω 7→ Xτ(ω)(ω) defines a Fτ − B-measurable random variable.

Proof. Indeed, for B ∈ B we can write

Xτ ∈ B ∩ τ ≤ t =⋃s≤tXs ∈ B ∩ τ = s.

Since all the sets occurring on the right-hand side are contained in Ft, and since the union onthe right-hand side is countable, we infer that the left-hand side is contained in Ft as well.

In particular, this lemma implies that for any adapted process (Xt), any B ∈ B, and any stoppingtime τ, one has Xτ ∈ B ∈ Fτ .One of the fundamental results in the theory of martingales is that the martingale property ispreserved under the evaluation at stopping times (under certain assumptions).

Theorem 2.1.12 (Optional sampling theorem). Let (Xt), t ∈ T , with T countable be a (sub-,super-) martingale with respect to a filtration (Ft), and assume that σ and τ are stopping timeswith σ ≤ τ ≤ N , some N ∈ N. Then (Xσ, Xτ ) is an integrable process adapted to (Fσ,Fτ ), and

E[Xτ | Fσ] = Xσ (2.1.2)

(with ‘=’ replaced by ‘≥’ and ‘≤’, respectively, in the case of a sub- and supermartingale).

Proof. We give the proof in the case of (Xt) being a submartingale; the case of a supermartin-gale follows by considering (−Xt), and the case of a martingale follows from combining thecorresponding results for the sub- and supermartingale.Adaptedness follows from Lemma 2.1.11, and the integrability follows from |Xσ| ∨ |Xτ | ≤maxs∈T, s≤N |Xs| and the fact that each Xs is integrable.It remains to show (2.1.2), and for this purpose we start with showing that if (Mt) is a martingale,then

E[MN | Fτ ] = Mτ . (2.1.3)

Indeed, the right-hand side of (2.1.3) is an Fτ − B-measurable random variable due to Lemma2.1.11, and furthermore, for any F ∈ Fτ we have

E[Mτ1F ] =∑

s∈T, s≤NE[Ms1τ=s1F ] =

∑s∈T, s≤N

E[E[MN | Fs]1τ=s∩F ]

]=

∑s∈T, s≤N

E[MN1τ=s∩F

]= E

[MN1F

],

where in the second equality we used the martingale property, in the third equality we tookadvantage of the fact that τ = s ∩ F ∈ Fs, cf. Remark 2.1.9, and the tower property again.In the last equality we used τ ≤ N. This establishes (2.1.3).Now for (Xt) a submartingale the Doob decomposition (Theorem 2.0.17) yields

Xt = Mt +At,

with (Mt) a martingale and (At) a previsible non-decreasing process with A0 = 0. Hence, wecan deduce

E[Xτ | Fσ] = E[Mτ +Aτ | Fσ] = E[E[MN | Fτ ] | Fσ] + E[Aτ | Fσ] ≥Mσ +Aσ = Xσ,

where we took advantage of (2.1.3) in the second equality and the inequality, and in the inequalitywe also used the tower property and the fact that (At) is non-decreasing.

23

Example 2.1.13.

In order to see that the previous result is not valid if one drops the boundedness assumptionson σ and τ, one can e.g. consider the following: Let X1, X2, . . . be a sequence of independentrandom variables such that

P(Xn = 2n) = P(Xn = −2n) =1

2.

Then Sn :=∑n

i=1Xi defines a martingale with respect to the canonical filtration (Fn). Definethe random time τ+ := infn ∈ N : Sn > 0. Then τ+ is a stopping time with respect to thecanonical filtration (Fn), and

P(τ+ <∞) = 1,

and

Sτ+ = 2.

In particular, (S1, Sτ+) is not a martingale with respect to (F1,Fτ+).

Thus, the implication of the optional sampling theorem is violated, and the reason it cannot beapplied is that the stopping time τ+ is not bounded.

Theorem 2.1.14 (Optional stopping theorem). Let (Xt), t ∈ T , be a (sub-, super-) martingalewith respect to a filtration Ft, and let τ be a stopping time. Then (Xt∧τ ), t ∈ T , is a (sub-,super-) martingale with respect to (Ft) as well as with respect to (Fτ∧t).

Proof. As in the proof of Theorem 2.1.12, we will give the proof for the case of (Xt) being asubmartingale only.

We first observe that Xt∧τ ∈ L1, and due to Lemma 2.1.11 we deduce that Xt∧τ is Ft∧τ − B-measurable (and hence also Ft − B-measurable).

Thus, it remains to show the corresponding inequalities for the conditional expectations.

E[X(t+1)∧τ | Ft∧τ ] ≥ Xt∧τ , ∀t ∈ T,

follows from Theorem 2.1.12, and hence (Xt∧τ ) is a submartingale with respect to (Ft∧τ ). Forshowing that it is a submartingale with respect to (Ft) also, we

first observe that on τ ≤ t ∈ Ft we have

X(t+1)∧τ = Xt∧τ .

As a consequence,

E[(X(t+1)∧τ −Xt∧τ ) | Ft] = E[(X(t+1)∧τ −Xt∧τ )1τ>t | Ft] = E[(Xt+1 −Xt)1τ>t | Ft]= 1τ>tE[(Xt+1 −Xt) | Ft] ≥ 0,

where the third equality takes advantage of τ > t ∈ Ft, and the inequality follows since (Xt)is a submartingale with respect to (Ft).

(Sub-, super-) Martingales and the optional stopping theorem provide very powerful tools, andpart of the difficulties in / art of applying them is to discover the actual martingale to which toapply the above machinery. This is still fairly easy in the following example, which is a gambler’sruin problem.

Example 2.1.15. Consider simple random walk (Sn) on the integers endowed with the canonicalfiltration (Fn) so that it becomes a martingale.

24

(a) For a < 0 < b with a, b ∈ Z we are interested in the probability that (Sn) hits a before ithits b. For this purpose, define the random time τ := τa,b := inf

n ∈ N0 : Sn ∈ a, b

.

and check that it is a stopping time with respect to (Fn).

As a consequence, the optional stopping theorem supplies us with the fact that the process(Sn∧τ ), n ∈ N0, is a martingale. Thus, in particular we have

0 = E[S0] = E[Sn∧τ ].

Applying the dominated convergence theorem to the right-hand side we get

0 = E[Sτ ],

but we also have that

E[Sτ ] = aP(Sτ = a) + bP(Sτ = b) = aP(Sτ = a) + b(1− P(Sτ = a)),

which all in all supplies us with

P(Sτ = a) =b

b− a.

(b) Although we know that (Sn) is a martingale and hence a fair game, we could hope tooutfox fate / the casino by applying the following strategy: We start at time 0 with 0capital (corresponding to S0 = 0) and stop playing at all once we have reached capital 1.I.e., setting

τ1 := infn ∈ N0 : Sn = 1,

we consider the sequence of random variables (Sn∧τ1). Now it is not too hard to show thatSRW is recurrent, i.e.,

P(Sn = 0 for infinitely many n ∈ N

)= 1,

and from this one can also infer that P(τ1 < ∞) = 1. This means, that we will almostsurely reach the point at which we have capital 1 in finite time. However, the optionalstopping theorem tells us that (Sn∧τ1), n ∈ N, still is a martingale with respect to (Fn)– so in particular we have E[Sn∧τ1 ] = 0 for each n ∈ N (this is yet another example inthe flavour of Example 2.1.13). Think about what this means in terms of the gamblingstrategy!

In the following example it is already slightly harder to identify the ‘right’ martingale, and forthe sake of completeness we give another version of the optional stopping theorem that will beproved as a homework exercise and employed in the following example.

Theorem 2.1.16. Let (Mn) be a martingale with bounded increments, i.e., there exists K ∈(0,∞) such that

|Mn+1 −Mn| ≤ K ∀n ∈ N,

and let τ be a stopping time which is integrable. Then

E[Mτ ] = E[M0].

Proof. Problem sheet 3 (Exercise 3.1)

25

Example 2.1.17. (Patterns in coin tossing) We want to compute the expected time it takes tofirst see a certain pattern in a sequence of fair coin tosses. For the sake of conciseness let usconsider the pattern ‘heads-heads-tail’, abbreviated HHT. The first time that this pattern occursin a sequence of fair coin tosses modeled by the sequence (Xn) is a stopping time with respect tothe canonical filtration (Fn) defined via Fn := σ(X1, . . . , Xn) (exercise). More formally, as inthe Example 4.2.7 of [Dre19] we consider an i.i.d. sequence (Xn) such that Xn is Rademacher-distributed; we identify Xn = 1 with the n-th coin toss resulting in heads, and Xn = −1with the n-th coin toss resulting in tails. Hence,

τHHT := infn ∈ N : Xn = Xn+1 = 1, Xn+2 = −1+ 2.

Before each time n ∈ N we consider a new player entering the scene, borrowing one unit fromthe bank, and betting it unit on Xn = 1. If she wins, she receives 2 units from the casino,if she loses, she loses the one unit she borrowed. In the latter case she stops playing, whereasin the former she continues and bets all her 2 units on Xn+1 = 1. Again, if she wins thisbet, her stakes are doubled and she receives 4 units from the casino, whereas if she loses, shestops playing being broke and still having the loan with the bank. In the former case, she bets4 units on Xn+2 = −1; if she wins, she gets 8 units, whereas if she loses, she’s broke (stillwith a loan). In either case she stops playing, but notice that she wins the last bet if and only ifτHHT = n+ 2.

Now writing Ain to denote the capital of the i-th player at time n, we can write, using the notationSn :=

∑ni=1Xi, that

Ain = (Y i • S)n,

where

Y in =

1, if n = i,2, if n = i+ 1 and Xn = 1,4, if n = i+ 2 and Xn+1 = 1,0, otherwise.

Thus, Ain − 1, n ∈ N (recall the loan of 1 unit from the bank, which implies the −1) is amartingale due to Theorem 2.0.15, and similarly we have for the accumulated wealth processVn :=

∑ni=1A

in−n that it is a martingale with V0 = 0. Furthermore, we also have that E[τHHT ] <

∞, since τHHT can be dominated by three times a geometric random variable. Therefore, andsince there exists a constant C ∈ (0,∞) such that

|Vn+1 − Vn| ≤ C ∀n ∈ N,

we can apply the version of the optional stopping theorem given above as Theorem 2.1.16 toobtain that E[VτHHT ] = E[V0] = 0. Furthermore, conditional on τHHT = n, n ≥ 2, we havethat the

AiτHHT =

0, if i < n,8, if i = n,

Thus, VτHHT = 8− τHHT , which in combination with the above yields

8− E[τHHT ] = 0,

and hence

E[τHHT ] = 8.

Note that e.g. E[τHHH ] is strictly bigger than E[τHHT ] (or the expected time of any other tripletof results except for TTT for that matter) – think about why this is the case!

26

Example 2.1.18 (The “abracadabra” problem). Let (Ui)i∈N denote random letters drawn in-dependently and uniformly from the english alphabet. We are interested in the expected amountof draws needed to observe the the pattern “abracadabra”, i.e. the expectation of

τ = minn ∈ N, n ≥ 11 : Ui = xi∀i ∈ 1, . . . , 11,

where x1 = ’a’, x2 = ’b’, x3 = ’r’ and so on. To show this, just like in Example 2.1.17, we mustfind the right martingale, show that it has bounded increments and show that the stopping timeτ is integrable.Finding the right process and proving that it is a martingale is using the same strategy as inExample 2.1.17 straightforward, so we only describe the strategy of the i-th player and leaveproving that it is a martingale with bounded increments as an exercise. Using the notation ofExample 2.1.17, we have

Y in =

1 if n < i

Y in−1 · 261Un=x(n−i)+1 if i ≤ n ≤ i+ 10

Y ii+10 if n > i+ 10

.

With this strategy, we get that the capital of player i Ain = (Y i • S)n is a martingale withexpectation 1. Assuming that τ is integrable, we get like in Example 2.1.17 that

0 = E[V0] = E[Vτ ] = 2611 + 264 + 26− E[τ ],

which gives us the desired answer. Before we proceed to show that τ is integrable, let us commenton why Vτ = 2611 + 264 + 26− τ . Player τ − 10 has entered the game precisely at the beginningof the sequence of letters that spell ”abracadabra” and has successfully won at every step, thushis capital is 2611 − 1. All the players that came before him had to have lost at some step oftheir play (since ”abracadabra” has not appeared before) and thus their capital is −1. The sameis true for most players that have entered the game after player τ − 10. For example, playerτ − 7 has entered the game when the second ”a” of the word ”abracadabra” is drawn. Therefore,he initially doubled his capital, but then lost it all in the following step (betting on ”b” when theletter ”c” was drawn). This is not the case only for two other players. Player τ − 3 has enteredthe game precisely when the second-last ”a” of ”abracadabra” has been drawn, and has in hisfirst 4 steps correctly bet on the letters ”a”,”b”,”r” and ”a”, resulting in his capital being 264−1.The last player to enter has similarly made a correct bet on ”a”. Note that these two playershave not made a wrong bet as of time τ , so their capital hasn’t dropped to −1 yet. Combiningall these players together, we observe the value −1 appearing τ times, which together with theterms 2611, 264 and 26 gives the full value of Vτ .It remains to argue that E[τ ] <∞. To show that, we will use Lemma A.2.1. By assumption, thevariables Ui are independent and therefore the probability of observing the sequence ”abracadabra”is p11, where p = 1/26. Since the event An that we observe the sequence ”abracadabra” duringthe steps n+ 1 to n+ 11 is contained in the event τ ≤ n+ 11, we have that

P(τ < n+ 11 | τ > n) ≥ P(An | τ > n)

for every n. However, the events An and τ > n are independent so that

P(τ < n+ 11 | τ > n) ≥ P(An | τ > n) = P (An) = p11.

By Lemma A.2.1 with N = 11 and ε = p11 we get that E[τ ] <∞.

Example 2.1.19. St. Petersburg paradox (due to Daniel Bernoulli, see https: // en.

wikipedia. org/ wiki/ Daniel_ Bernoulli , who also was a professor in St. Petersburg)Recall the setting of Example 2.0.16.

27

2.1.1 Inequalities for (sub-)martingales

In what comes below, the following definition comes in handy.

Definition 2.1.20. For (Xt)t∈T (with T ⊂ N0) a real-valued stochastic process we define itsrunning maximum

X∗t := maxs∈T, s≤t

Xs. (2.1.4)

In the following we will get to know some inequalities which are usually associated to J. Doob.There is, however, some ambiguity about the exact naming of them, and the names I quote beloware the result of a short and non-representative survey of the nomenclature in some standardsources.The following can be interpreted as an extension of Markov’s inequality.

Lemma 2.1.21 (Doob’s (submartingale) inequality). Let (Xt)t∈T be a submartingale. Then forany λ ∈ (0,∞),

λ · P(X∗t ≥ λ) ≤ E[Xt · 1X∗t ≥λ]

Proof. We define the stopping time

τλ := mins ∈ T : Xs ≥ λ

.

Since τλ ∧ t is bounded, the optional sampling theorem (Theorem 2.1.12) supplies us with

E[Xt] ≥ E[Xτλ∧t] = E[Xt · 1X∗t <λ] + E[Xτλ · 1X∗t ≥λ] ≥ E[Xt · 1X∗t <λ] + λP(X∗t ≥ λ),

soλP(X∗t ≥ λ) ≤ E[Xt · 1X∗t ≥λ].

Remark 2.1.22. It should be noted that although Markov’s inequality is sharp e.g. in the caseof the constant random variable X = λ, the above allows us to replace the left-hand side ofMarkov’s inequality by a sum of probabilities of corresponding events.

We will apply Lemma 2.1.21 in the proof of the result below.In the following, similarly to (2.1.4), we define

|X|∗t := maxs≤t|Xs|.

Theorem 2.1.23 (Doob’s Lp-inequality or also Doob’s maximum inequality).Let (Xt)t∈T be a martingale or a positive submartingale with countable index set T . Then forany p ∈ (1,∞) and t ∈ T,

E[|Xt|p

]≤ E

[(|X|∗t )p

]≤( p

p− 1

)pE[|Xt|p

]. (2.1.5)

The statement of Theorem 2.1.23 fails to hold in the case p = 1 in the sense that there existsno finite constant C ∈ (0,∞) such that for any (Xt) as in the assumptions of Theorem 2.1.23one has

E[(|X|∗t )p

]≤ CE

[|Xt|p

]. (2.1.6)

Indeed, consider simple random walk starting in 1, i.e., Sn = 1 +∑n

j=1Xj and the (Xj) i.i.d.Rademacher distributed. Furthermore, let τ0 := infn ∈ N : Sn = 0. Then Theorem 2.1.14implies that (Sn∧τ0) is a martingale and hence E[|Sn∧τ0 |] = E[S0] = 1. On the other hand, onecan show that E[|Sn∧τ0 |∗]→∞ as n→∞, which implies that (2.1.6) cannot hold.One does, however, have a version of the result with 1 + E[|Xt| ln(|Xt| ∨ 1)] on the right-handside.

28

Proof of Theorem 2.1.23.The first inequality is clear.We start with noticing that (|X|t) is a submartingale (either by assumption or taking advantageof Theorem 2.0.11) and thus Lemma 2.1.21 yields

λP(|X|∗t ≥ λ) ≤ E[|Xt| · 1|X|∗t≥λ]. (2.1.7)

We will truncate |X|∗t and consider |X|∗t ∧M for M > 0 instead.Then, using that for r > 0 we have rp =

∫ r0 pλ

p−1 dλ, we get

E[(|X|∗t ∧M)p

]= E

[ ∫ |X|∗t∧M0

pλp−1 dλ]

= E[ ∫ M

0pλp−1 · 1|X|∗t≥λ dλ

],

which, using Tonelli’s theorem, equals∫ M

0pλp−1 · P(|X|∗t ≥ λ) dλ ≤

∫ M

0pλp−2 · E[|X|t · 1|X|∗t≥λ] dλ = pE

[|Xt|

∫ |X|∗t∧M0

λp−2 dλ]

=p

p− 1E[|Xt|(|X|∗t ∧M)p−1

],

(2.1.8)

and where we used (2.1.7) in the inequality.Now since 1

p + 1p/(p−1) = 1, we can apply Holder’s inequality with exponents p and p/(p− 1) to

obtainE[|Xt|(|X|∗t ∧M)p−1

]≤ E[|Xt|p]

1pE[(|X|∗t ∧M)p]

p−1p .

Hence, we can continue (2.1.8) to get

E[(|X|∗t ∧M)p

]≤ p

p− 1E[|Xt|p]

1pE[(|X|∗t ∧M)p]

p−1p .

Raising both sides of the inequality to the p-th power and dividing by E[(|X|∗t ∧M)p]p−1 (whichis finite due to the truncation with M), we obtain

E[(|X|∗t ∧M)p

]≤( p

p− 1

)pE[|Xt|p].

Taking M →∞ yields (2.1.5) and hence finishes the proof.

2.2 A first martingale convergence theorem

We will now see some basic results on martingale convergence.Our start will be a fundamental inequality involving the so-called upcrossings of an interval [a, b]by a submartingale (Xn). For this purpose, we keep in mind the following trading strategy: We‘buy’ a unit of the submartingale once it falls below a, and we sell once it exceeds b.More formally: Define the stopping times σ0 := 0,

τk := infn ≥ σk−1 : Xn ≤ a, k ≥ 1,

σk := infn ≥ τk : Xn ≥ b, k ≥ 1.

We can then define the number of upcrossings of (Xn) over [a, b] up to time m ∈ N via

Um[a, b] := supk : σk ≤ m.

29

Lemma 2.2.1 (Upcrossing inequality). Let (Xn) be a submartingale. Then

E[Un[a, b]

]≤ E[(Xn − a)+]− E[(X0 − a)+]

b− a.

Proof. Recall from Theorem 2.0.15 that the discrete stochastic integral (H •X)n is a submartin-gale if (Xn) is a submartingale and (Hn) is an adapted, bounded, non-negative process. Puttingthe above trading strategy in mathematical terms we can define Hn := 1 for n ∈ τk, . . . , σk−1and all k ∈ N, and Hn := 0 otherwise. Then (Hn) is certainly adapted, bounded, non-negative(check!). If we define Yn := a + (Xn − a)+ this yields a submartingale (e.g. due to part (b) ofTheorem 2.0.11), and as a consequence of Theorem 2.0.15, (H • Y )n defines a submartingale.Furthermore, we have for k ∈ N on the event σk <∞ that

(H • Y )σk =k∑j=1

( Yσj︸︷︷︸≥b

− Yτj︸︷︷︸≤a

) ≥ (b− a)k.

By our definition of (Yn), for n ∈ σk, . . . , σk+1 − 1, we have (H • Y )n − (H • Y )σk ≥ 0, so

(H • Y )n ≥ (b− a)Un[a, b].

As a consequence, taking expectations we arrive at

E[(H • Y )n] ≥ (b− a)E[Un[a, b]].

Setting Kn := 1 −Hn ≥ 0 defines a bounded and adapted process, and hence we deduce in ananalogous way that (K • Y )n is a submartingale, which implies that

E[(K • Y )n] ≥ E[(K • Y )0] = 0. (2.2.1)

We have

(H • Y )n + (K • Y )n = Yn − Y0,

so

E[(H • Y )n] + E[(K • Y )n] = E[Yn]− E[Y0],

which in combination with (2.2.1) yields

E[(H • Y )n] ≤ E[Yn]− E[Y0],

so

E[Un[a, b]] ≤ E[(Xn − a)+]− E[(X0 − a)+]

b− a,

which proves the result.

Theorem 2.2.2 (Martingale convergence theorem). Let (Xn) be a submartingale with

supn∈T

E[X+n ] <∞. (2.2.2)

Then there exists a random variable X∞ ∈ L1(Ω,F∞,P) such that

Xn → X∞ P− a.s.

as n→∞.

30

Proof of Theorem 2.2.2. For any a, b ∈ R with a < b, the monotone limit U [a, b] :=limn→∞ Un[a, b] exists as a random variable mapping from (Ω,F ,P) to ([0,∞],B([0,∞])). Usingthe monotone convergence theorem and Lemma 2.2.1, we thus infer

E[U [a, b]] = limn→∞

E[Un[a, b]] ≤ lim supn→∞

E[(Xn − a)+]− E[(X0 − a)+]

b− a≤ lim sup

n→∞

E[X+n ] + |a|b− a

<∞,

where the latter inequality follows from (2.2.2). In particular,

P(U [a, b] <∞) = 1. (2.2.3)

But we can write

Xn does not converge in [−∞,∞] ⊂ lim infn→∞

Xn < lim supn→∞

Xn

⊂⋃a,b∈Q

lim infn→∞

Xn < a < b < lim supn→∞

Xn ⊂⋃a,b∈Q

U [a, b] =∞.

Thus, in combination with (2.2.3) we infer that

P(Xn converges in [−∞,∞]) ≥ 1−∑a,b∈Q

P(U [a, b] =∞) = 1,

and we denote the P-a.s. limit of (Xn) by X∞ which, as the limit of F∞ − B(R)-measurablerandom variables is F∞ − B(R)-measurable itself. To show we do in fact even have X∞ ∈ L1,observe that Fatou’s lemma supplies us with

E[X+∞] = E[( lim

n→∞Xn)+] = E[lim inf

n→∞(X+

n )] ≤ lim infn→∞

E[X+n ],

and the right-hand side is finite due to (2.2.2). On the other hand, since (Xn) is a submartingale,we obtain

E[X−n ] = E[X+n ]− E[Xn] ≤ E[X+

n ]− E[X0].

Applying Fatou’s lemma again supplies us with

E[X−∞] = E[lim infn→∞

X−n ] ≤ lim infn→∞

E[X−n ] ≤ lim infn→∞

E[X+n ]− E[X0],

and again the right-hand side is finite due to (2.2.2). This shows X∞ ∈ L1 and finishes theproof.

Exercise 2.2.3. Prove that in the context of Theorem 2.2.2, the condition in (2.2.2) is equivalentto the sequence (Xn) being L1-bounded, i.e.,

supn∈N

E[|Xn|] <∞.

The above will enable us to give a generalisation of the Borel-Cantelli lemma: The independenceassumption on the events (An) can be dropped in this version of the lemma. Before, however,we have to give a small auxiliary lemma which is of interest on its own.

Lemma 2.2.4. Let (Mn) be a martingale and assume there exists a constant K ∈ (0,∞) suchthat the increments of (Mn) are bounded by K, i.e.,

|Mn −Mn+1| ≤ K ∀n ∈ N.

Then forCM :=

(Mn) converges in R as n→∞

and

FM :=

lim supn→∞

Mn = − lim infn→∞

Mn =∞,

we have P(CM ∪ FM ) = 1.

31

Proof. First observe that Mn := Mn −M0 defines a martingale with M0 = 0. For L < 0 wedefine the stopping time

τL := infn ∈ N : Mn ≤ L.

The optional stopping theorem (Theorem 2.1.14) implies that

(Mn∧τL)

is a martingale, and furthermore we have Mn∧τL ≥ L−K for all n ∈ N. Hence, (Mn∧τL−(L−K))defines a non-negative martingale, which due to Theorem 2.2.2 converges P-a.s. to a randomvariable in L1. Hence, on τL =∞, the martingale (Mn−(L−K)) and hence also the martingale(Mn) converge P-a.s. to a finite limit. Taking L → −∞ we deduce that (Mn) converges in RP-a.s. on ⋃

L∈−NτL =∞ =

lim infn→∞

Mn > −∞.

Repeating the above procedure for the martingale (−Mn), we obtain that (−Mn) and hence also(Mn) converges P-a.s. on

lim supn→∞

Mn <∞.

But (Mn) converges if and only (−Mn) does, solim infn→∞

Mn > −∞

=

lim supn→∞

Mn <∞,

and hence lim infn→∞

Mn = −∞

=

lim supn→∞

Mn =∞,

which proves the lemma.

Corollary 2.2.5 (Generalised Borel-Cantelli lemma). Let (Fn) be a filtration with F0 = ∅,Ω,and let (An) be a sequence of events with An ∈ Fn for each n ∈ N.Define

Ym :=m∑n=1

P(An | Fn−1)

and

Zm :=

m∑n=1

1An

as well as their limits

Y∞ := limm→∞

Ym,

and

Z∞ := limm→∞

Zm.

Then P-a.s. the following implications hold true:

(a)

Y∞ <∞ =⇒ Z∞ <∞;

(b)

Y∞ =∞ =⇒ ZmYm→ 1 as m→∞.

32

Proof. We will only prove the weaker version

P(Z∞ =∞∆Y∞ =∞

)= 0. (2.2.4)

A proof of the fully-fledged result given in the theorem requires a finer understanding of con-vergence of square integrable martingales and can be found as the proof of Theorem 12.15in [Wil91].To prove (2.2.4), we observe that for the martingale (check!) Mm := Zm−Ym we have |Mm+1−Mm| ≤ 1 for all m ∈ N. Hence, we can apply Lemma 2.2.4 which (in the notation used in there)yields that on FM one has that

Z∞ =∞ and Y∞ =∞,

whereas on CM one has that

Z∞ =∞ if and only if Y∞ =∞.

Since Lemma 2.2.4 implies P(CM ∪ FM ) = 1, this proves the result.

Corollary 2.2.6. If (Xn) is a non-negative supermartingale, then X∞ := limn→∞Xn existsP-a.s. with E[X∞] ≤ E[X0].

Proof. We know that (−Xn) is a submartingale with supn∈T E[(−Xn)+] ≤ 0 < ∞, so theprevious theorem implies that −X∞ := limn→∞−Xn exists in L1 P-a.s. In addition, Fatou’slemma yields

E[X∞] = E[lim infn→∞

Xn] ≤ lim infn→∞

E[Xn] ≤ E[X0],

where the last inequality follows from (Xn) being a supermartingale.

Example 2.2.7. (a) Consider simple random walk starting in 1, i.e., Sn := 1 +∑n

k=1Xk,where the Xk are i.i.d. Rademacher distributed, and consider the first hitting time of 0defined via

τ0 := infn ∈ N : Sn = 0.

Keeping in mind that (Sn∧τ0) is a non-negative martingale due to Theorem 2.1.14, we canuse Corollary 2.2.6 in order to infer that Sn∧τ0 converges P-a.s. to 0 as n→∞. Obviously,this implies P(τ0 <∞) = 1.

Also, note that E[Sn∧τ0 ] = E[S0] = 1, since (Sn∧τ0) is a martingale. In particular, thisshows that (Sn∧τ0) cannot converge in L1. This is different from so-called uniformly inte-grable martingales that we will get to know below, see Section 2.3.

(b) Still considering Sn to be simple random walk (can start it in 0 though, so Sn :=∑n

k=1Xk),since the martingale (Sn) has increments bounded by 1, we infer using Lemma 2.2.4 (againtaking advantage of the fact that Sn cannot converge since increments are of size 1 at eachtime, and hence (Sn) cannot converge in R) that we must have

P(lim supn→∞

Sn = − lim infn→∞

Sn =∞) = 1.

This implies the recurrence of simple random walk, namely that P(Sn =0 for infinitely many n ∈ N) = 1.

Corollary 2.2.8. Let p ∈ [1,∞), X ∈ Lp, and let (Fn) be a filtration. Then

E[X | Fn]→ E[X | F∞] P− a.s.

33

Proof. According to Example 2.0.8 (b), setting Xn := (E[X | Fn]) constitutes a martingale (Xn)with

supn∈T

E[X+n ] ≤ sup

n∈TE[|Xn|] = sup

n∈TE[∣∣E[X | Fn]

∣∣] ≤ supn∈T

E[E[|X| | Fn]] = E[|X|] <∞,

where the penultimate inequality follows from the conditional Jensen inequality, the last equalityfrom the tower property for conditional expectations, and the last equality from the fact thatX ∈ Lp ⊂ L1. Hence, the previous theorem implies the P-a.s. convergence of Xn to X∞ whichis F∞ − B-measurable.It remains to argue that

X∞ = E[X | F∞], (2.2.5)

and this will be done in Example 2.3.12 as a demonstration how powerful the machinery ofuniform integrability is.

The following is a useful corollary to Corollary 2.2.8. It states that if the σ-algebra we condi-tion on is countably generated, then we can approximate the general conditional expectationintroduced in Definition 1.0.3 by the basic one we had introduced before in (1.0.1). To makethis more precise, recall (see Exercise 1.0.8) that for a finite partition P of Ω with P ⊂ F , andX ∈ Lp some p ∈ [1,∞) we have

E[X |σ(P)] =∑A∈P

P(A)>0

E[X |A]1A.

Corollary 2.2.9. Let (Pn) be a sequence of finite partitions of Ω with σ(Pn) ⊂ σ(Pn+1) ⊂ Fand set Xn := E[X |σ(Pn)] as well as

G := σ( ⋃n∈NPn).

ThenXn → E[X | G] P− a.s. and in Lp.

Proof. Exercise.

This last corollary can also be used to transfer other results which we know for expectations (andhence also for conditional expectations as defined in (1.0.1)) to general conditional expectations– an example for this is Jensen’s inequality.It turns out that under stronger assumptions on the martingales in question, our results of thissection can be significantly improved (as one example, the convergence in Corollary 2.2.8 canbe extended to convergence in Lp). For this purpose, we will introduce the concept of uniformintegrability.

2.3 Uniform integrability

It will turn out that the concept of uniform integrability will prove very helpful in the contextof (convergence of) martingales. In order to be able to take full advantage of it, we have toestablish some basic properties first, which are interesting in their own right. We will afterwardsapply this machinery to our setting of martingales.To start with, observe that for a single random variable X ∈ L1 one has that the mapping

F 3 F 7→ E[X1F ] =

∫FX dP

34

defines a finite (signed) measure on F . By definition, this measure is then absolutely continuouswith respect to P, and since X ∈ L1 we obtain that for each ε > 0 there exists M ∈ (0,∞) suchthat

∫|X|≥M |X|dP ≤ ε. The notion of uniform integrability introduced below is a generalisation

of this property to families of random variables for which the respective M can be chosenuniformly in all members of the family.

Definition 2.3.1. A family S ⊂ L1(Ω,F ,P) is called uniformly integrable (‘gleichgradig inte-grierbar’) if for every ε > 0 there exists M > 0 such that

supX∈S

∫|X|≥M

|X|dP ≤ ε.

The notion of uniform integrability can be extended to infinite measure spaces (Ω,F , µ) withµ(Ω) = ∞. Since this makes some of the reasonings slightly more technical, and since wewill primarily be interested in probability spaces, we stick to the setting of probability spaces(Ω,F ,P) here.

Example 2.3.2. (a) If (fλ), λ ∈ Λ, with |Λ| <∞ is a finite family of random variables withfλ ∈ L1 for all λ ∈ Λ, then the family (fλ) is uniformly integrable.

(b) If S ⊂ L1(Ω,F ,P) is a uniformly integrable family, then any subset R ⊂ S is also auniformly integrable family.

(c) If R,S ⊂ L1(Ω,F ,P) are uniformly integrable families, then R ∪ S is also a uniformlyintegrable family.

(d) Let S ⊂ L1(Ω,F ,P) such that there exists g ∈ L1(Ω,F ,P) with

|f | ≤ g ∀f ∈ S. (2.3.1)

Then S is uniformly integrable. Indeed, since g ∈ L1(Ω,F ,P), for each ε > 0 we findM ∈ (0,∞) such that ∫

|g|≥Mg dP ≤ ε.

Furthermore, from (2.3.1) we deduce

supf∈S

∫|f |≥M

|f |dP ≤∫|g|≥M

g dP,

which in combination with the previous gives us the desired uniform integrability.

(e) If for some p ∈ (1,∞) the family (|fλ|p), λ ∈ Λ, is uniformly integrable, then, for anyq ∈ (1, p), so is the family (|fλ|q), λ ∈ Λ.

Indeed, if (|fλ|p), λ ∈ Λ, is uniformly integrable, then for any ε > 0 we have M ∈ (1,∞)such that the last inequality of the following holds true,

supλ∈Λ

∫|fλ|q≥M

qp

|fλ|q dP ≤ supλ∈Λ

∫|fλ|p≥M

(1 ∨ |fλ|p) dP = supλ∈Λ

∫|fλ|p≥M

|fλ|p dP <∞,

where the equality follows from M > 1, and this proves the claim.

(f) Let X ∈ L1(Ω,F ,P). Then the familyE[X | G] : G is a σ-algebra with G ⊂ F

is uniformly integrable (easier to prove with the proof of Theorem 2.3.6 below).

35

Definition 2.3.3. For p ∈ [1,∞], we call a family S ⊂ Lp(Ω,F ,P) (Lp)-bounded, if

supX∈S‖X‖p <∞.

The following lemma can be seen as a generalisation to families of uniformly integrable randomvariables of the following absolute continuity property for integrable random variables: If X ∈L1, then for each ε > 0 there exists δ > 0 such that for all F ∈ F with P(F ) ≤ δ, one hasE[|X|1F ] ≤ ε.

Lemma 2.3.4. A family (Xn), n ∈ T , of random variables is uniformly integrable if and only if

(a) it is L1-bounded, and

(b) for every ε > 0 there is δ > 0 such that for all A ∈ F with P(A) < δ, one has

E[|Xn| · 1A] < ε ∀n ∈ T.

Proof. If the family (Xn) is uniformly integrable, then for any ε > 0 we have Mε ∈ (0,∞) suchthat E[|Xn|1|Xn|≥Mε

] ≤ ε, so for each n ∈ T,∫|Xn| dP ≤

∫|Xn|≥Mε

|Xn|dP +Mε ≤Mε + ε,

which means that the family is L1-bounded. In addition, for ε > 0 arbitrary, we have withδ := ε

Mεthat for F ∈ F with P(F ) ≤ δ,

E[|Xn| · 1F ] = E[|Xn| · 1F1|Xn|≥Mε] + E[|Xn| · 1F1|Xn|<Mε

] ≤ ε+Mε · δ = 2ε,

which shows part (b).Conversely, assume that (a) and (b) are fulfilled. Fix ε > 0 arbitrarily and choose δ > 0accordingly as in (b). Then (a) implies C := supn∈T E[|Xn|] < ∞, and we can choose Mδ ∈(0,∞) such that the last inequality of

P(|Xn| ≥Mδ) ≤E[|Xn|]Mδ

≤ C

!≤ δ ∀n ∈ T

holds true. Thus, by part (b) we have

E[|Xn| · 1|Xn|≥Mδ] ≤ ε ∀n ∈ T

which shows the uniform integrability.

Exercise 2.3.5. (a) The sequence Xn := n1[0, 1n

] of random variables defined on

([0, 1],B([0, 1]), λ|[0,1]) is L1-bounded, but (b) of Lemma 2.3.4 is not fulfilled.

(b) Find an example of a family of random variables which fulfills part (b) of Lemma 2.3.4 butwhich does not fulfill part (a), and hence is not uniformly integrable.

A main tool for showing the uniform integrability of a family of random variables is given bythe following characterisation.

Theorem 2.3.6. Let S ⊂ L1(Ω,F ,P). Then S is uniformly integrable if and only if there existsa function ϕ : [0,∞)→ [0,∞) such that

(a)ϕ(x)

x→∞ as x→∞, (2.3.2)

36

(b) and

supX∈S

∫Ωϕ(|X|) dP <∞. (2.3.3)

Furthermore, ϕ can be chosen to be non-decreasing and convex.

Proof. If S is uniformly integrable, then for any n ∈ N there exists Mn ∈ (0,∞) such that

supX∈S

∫|X|≥Mn

|X|dP ≤ 2−n,

and without loss of generality we can and do choose the Mn in such a way that limn→∞Mn =∞.A fortiori

supX∈S

∫(|X| −Mn)+ dP ≤ 2−n. (2.3.4)

Setting

ϕ : [0,∞) 3 x 7→∑n∈N

(x−Mn)+ ∈ [0,∞)

defines an non-decreasing function, which is also convex, since it is the sum of convex functions.In addition,

ϕ(x)

x=∑n∈N

(1−Mn/x)+ →∞ as x→∞,

so (2.3.2) holds true. In addition, we deduce using (2.3.4) in combination with the monotoneconvergence theorem that

supX∈S

∫ϕ(|X|) dP ≤

∑n∈N

supX∈S

∫(|X| −Mn)+ dP ≤ 1,

which establishes (2.3.3).To prove the converse implication, assume that (2.3.3) and (2.3.2) hold true. These propertiesimply that for M large enough, for any ε > 0 there exists Mε ∈ (0,∞) such that we have that

ϕ(x) ≥ xsupX∈S

∫Ω ϕ(|X|) dPε

for all x ≥Mε.

Hence, uniformly in X ∈ S,∫|X|≥Mε

|X|dP ≤ ε

supX∈S∫

Ω ϕ(|X|) dP

∫|X|≥Mε

ϕ(|X|) dP ≤ ε,

which shows the uniform integrability of S.

Corollary 2.3.7. Let p > 1 and let S ⊂ Lp be an Lp-bounded family. Then for any q ∈ [1, p),the family

|X|q : X ∈ S

is uniformly integrable.

Proof. We define ϕ : [0,∞)→ [0,∞) via

ϕ(x) := xpq .

Then ϕ(x)/x→∞ as x→∞, and

supX∈S

∫Ωϕ(|X|q) dP = sup

X∈S

∫Ω|X|p dP <∞

due to the assumption that S is a Lp-bounded family. Via Theorem 2.3.6 this implies theresult.

37

Exercise 2.3.8. Show that under the assumptions of Corollary 2.3.7, the family

|X|p : X ∈ S

is not necessarily uniformly integrable.

One of the reasons that the notion of uniform integrability is so powerful is the following result.

Theorem 2.3.9. Let (Xn), n ∈ N, and X all be elements of L1(Ω,F ,P). Assume furthermorethat Xn converges to X in probability. Then the following are equivalent:

(a) The family (Xn), n ∈ N, is uniformly integrable.

(b) Xn converges to X in L1.

(c)E[|Xn|]→ E[|X|] <∞.

Remark 2.3.10. Note that the above describes a very special situation since we have seen in‘probability theory I’ that convergence in probability is generally weaker than convergence in L1.

Proof of Theorem 2.3.9.‘(a) =⇒ (b)’:For notational convenience writeX0 := X. Then the family (Xn), n ∈ N0, is uniformly integrable.Therefore, for ε > 0 arbitrary, Lemma 2.3.4 implies that we find δ > 0 such that for all F ∈ Fwith P(F ) ≤ δ we have

E[|Xn| · 1F ] ≤ ε ∀n ∈ N0. (2.3.5)

Since Xn converges to X in probability as n→∞, we can find Nδ such that for all n ≥ Nδ wehave P(|Xn −X| ≥ ε) ≤ δ. As a consequence, for such n we deduce

E[|Xn −X|] ≤ E[|Xn −X| · 1|Xn−X|≤ε] + E[|Xn −X| · 1|Xn−X|≥ε]≤ ε+ E[|Xn| · 1|Xn−X|≥ε] + E[|X| · 1|Xn−X|≥ε],

and the last two summands are each upper bounded by ε due to (2.3.5), which establishes (b).‘(b) =⇒ (c)’: Jensen’s inequality implies the first and the reverse triangle the second inequalityof ∣∣∣E[|Xn|]− E[|X|]

∣∣∣ ≤ E[∣∣∣|Xn| − |X|

∣∣∣] ≤ E[|Xn −X|],

and the right-hand side converges to 0 as n→∞ by assumption.‘(c) =⇒ (a)’:We set X0 := X and start with noting that the family (Xn), n ∈ N0, is L1-bounded due to (c).Now let ε > 0 be arbitrary. Then since X ∈ L1, there exists δ ∈ (0, ε) such that for all P(A) < δ,we have

E[|X| · 1A] < ε. (2.3.6)

Furthermore, due to XnP−→ X and (c), there exists Nδ ∈ N such that, setting Gδ,n := |Xn −

X| < δ, for all n ≥ Nδ we have

P(Gcδ,n) < δ and |E[|Xn|]− E[|X|]| < ε. (2.3.7)

Hence, due to our choice of δ,∣∣∣E[|Xn| · 1Gcδ,n ]− E[|X| · 1Gcδ,n ]∣∣∣ ≤ ∣∣E[|Xn|]− E[|X|]

∣∣+∣∣E[(|Xn| − |X|) · 1Gδ,n

]∣∣︸ ︷︷ ︸≤δ

< 2ε, n ≥ Nδ,

(2.3.8)

38

so in combination with (2.3.7) and (2.3.6),

E[|Xn| · 1Gcδ,n ] ≤ 3ε. (2.3.9)

Now for arbitrary A ∈ F with P(A) ≤ δ we get that for n ≥ Nδ,

E[|Xn| · 1A] ≤ E[|Xn| · 1Gcδ,n ] + E[|Xn| · 1Gδ,n∩A], (2.3.10)

and by definition of Gδ,n, the second summand on the right-hand side is upper bounded by

E[|Xn| · 1Gδ,n∩A] ≤ E[|X| · 1A] + δ ≤ 2ε, (2.3.11)

again due to (2.3.6). Summarising (2.3.11), (2.3.10), and (2.3.9), we infer that for n ≥ Nδ,

E[|Xn| · 1A] ≤ 5ε.

Since ε > 0 was chosen arbitrarily, and since as observed above, the family (Xn) is L1-bounded,in combination with Lemma 2.3.4, this shows the desired uniform integrability.

Proposition 2.3.11. Let (Xn) be a (sub-, super-) martingale with Xn → X∞ in L1 as n →∞. Then Xn = E[X∞ | Fn] (with ‘=’ replaced by ‘≤’ and ‘≥’ in the case of a sub- and asupermartingale, respectively).

Proof. We give the proof in the case of (Xn) being a martingale; the modifications in the caseof a sub- or supermartingale are minor and left to the Reader.Note that on the one hand, for F ∈ Fn we have∫

FXn dP =

∫FXk dP (2.3.12)

for all k ≥ n due to the martingale property E[Xk | Fn] = Xn. On the other hand, due to theL1-convergence in the assumptions, we get

limk→∞

∫FXk dP =

∫FX∞ dP.

Combining this with (2.3.12) and the fact that F ∈ Fn had been chosen arbitrarily, we inferthat

Xn = E[X∞ | Fn].

Example 2.3.12. (a)

Recall that we still had to prove (2.2.5); with Theorem 2.3.9 at our disposal, this boilsdown to a piece of cake: Indeed, Example 2.3.2 (f) tells us that the family of randomvariables (E[X | Fn]) is uniformly integrable. But in the partial proof of Corollary 2.2.8 wehad that (E[X | Fn]) converges P-a.s. to some limit X∞, and hence Theorem 2.3.9 impliesthat it converges to X∞ in L1 as well. Now the characterisation of the limit follows fromProposition 2.3.11.

(b) Now that we have completed the proof of Corollary 2.2.8 we show how it provides us withan easy proof of Kolmogorov’s 0− 1-law (cf. Theorem 3.2.16 in [Dre19]). Indeed, let (Xn)be an independent sequence of random variables, Gn := σ(Xn), and let Fn := σ(∪ni=1Gn).Then for F in the tail-σ-algebra, i.e.,

F ∈ T :=

∞⋂n=1

σ( ∞⋃m=n

Gm).

39

we have P(F ) ∈ 0, 1.

Indeed, since F is independent of Fn for any n ∈ N, we have that P(F | Fn) = P(F ). Onthe other hand, Corollary 2.2.8 implies that P-a.s.,

P(F | Fn)→ P(F | F∞), n→∞,

and the right-hand side equals 1F since T ⊂ F∞, so we have P(P(F ) 6= 1F ) = 0, whichmeans that P(F ) ∈ 0, 1.

Exercise 2.3.13. Derive the following version of the dominated convergence theorem usingTheorem 2.3.9:

Theorem. Let X, Xn, n ∈ N, be real random variables such that Xn converges to X in prob-ability, and assume there is Y ∈ L1 such that for all n ∈ N we have |Xn| ≤ Y P-a.s. ThenX ∈ L1, and Xn converges to X in L1, so in particular

∫Xn dP→

∫X dP as n→∞.

In fact, if one introduces the notion of uniform integrability for general measure spaces anddevelops the analogous machinery for uniformly integrable families in this setting, one can de-duce the general version of the dominated convergence theorem from the result corresponding toTheorem 2.3.9.

2.3.1 Convergence theorems for uniformly integrable martingales

As we have seen before, if a sequence of random variables is uniformly integrable, this might addto its convergence properties. In particular, this is true for martingales as we will see below. Asa first observation, if (Mn) is a uniformly integrable martingale, then Lemma 2.3.4 implies thatthe family (Mn) is bounded in L1 and hence (Mn) converges P-a.s. as n→∞ due to Theorem2.2.2. We can then use Theorem 2.3.9 to deduce further properties, and this is summarised inthe following result.

Theorem 2.3.14. Let (Mn), n ∈ N0, be a submartingale. Then the following are equivalent:

(a) the family (Mn), n ∈ N0, is uniformly integrable;

(b) the sequence (Mn), n ∈ N0 converges P-a.s. and in L1 to M∞ ∈ L1;

(c) the sequence (Mn), n ∈ N0 converges in L1 to M∞ ∈ L1.

Proof. ‘(a) =⇒ (b)’: Since (Mn) is a uniformly integrable submartingale, Lemma 2.3.4 impliesthat the family (Mn) is bounded in L1 and hence (Mn) converges P-a.s. as n → ∞ to someM∞ ∈ L1 due to Theorem 2.2.2. Theorem 2.3.9 (c) implies that this convergence also is in L1.

‘(b) =⇒ (c)’: This is evident.

‘(c) =⇒ (a)’: This is a direct consequence of Theorem 2.3.9.

By the usual reasoning the previous theorem also applies to martingales and supermartingales.

Theorem 2.3.15. If one of the equivalent conditions in Theorem 2.3.14 is fulfilled, then thereexists M∞ ∈ L1 such that

Mn = E[M∞ | Fn] ∀n ∈ N

(with ‘=’ replaced by ‘≤’ and ‘≥’ in the case of (Mn) being a submartingale and supermartingale,respectively).

Proof. This follows from Proposition 2.3.11.

40

Compare the previous theorem to Example 2.0.8, where we had shown that for any randomvariable X and a filtration (Fn), the sequence (E[X | Fn]) supplies us with a martingale. Thelast item of the above is a converse to this, and a martingale with this property is also calledright-closable (which, due to the above, is the same as the martingale being uniformly integrable).

In the case where we do have that a martingale even is Lp-bounded, p ∈ (1,∞), we can strengthenthe convergence given above.

Theorem 2.3.16. Let p ∈ (1,∞), and let (Mn) be a martingale which is Lp-bounded. Thenthere exists M∞ ∈ Lp such that Mn →M∞ P-a.s. and in Lp.

In particular, the sequence (|Mn|p) is uniformly integrable.

Proof. Since

supn∈T

E[|Mn|] ≤ supn∈T

E[|Mn|p]1p <∞

since (Mn) is Lp-bounded, our plain vanilla martingale convergence theorem, Theorem 2.2.2,implies that (Mn) converges P-a.s. to some M∞ ∈ L1.

Doob’s Lp-inequality (Theorem 2.1.23) stated that

E[(|M |∗n)p] ≤( p

p− 1

)pE[|Mn|p],

and since the right-hand side is bounded in n by assumption, we deduce taking n→∞ on theleft-hand side that M∞ ∈ Lp. Furthermore, since we have

|Mn −M∞| ≤ 2 limn→∞

|M |∗n,

and since the right-hand side is in Lp due to the previous, we deduce by the DCT and Mn →M∞P-a.s., that Mn →M∞ in Lp also.

The stated uniform integrability immediately follows from

|Mn| ≤ limn→∞

|M |∗n ∈ Lp.

Example 2.3.17. If we investigate our favourite martingale simple random walk (Sn) underthe light of convergence, then, since |Sn+1 − Sn| = 1, we know that it does not converges P-a.s.,and this is consistent with the above since (Sn) is not uniformly integrable.

2.3.2 Square integrable martingales

As we have seen before in the context of conditional expectations, having square integrabilitycan nicely bring in geometry. This is also one of the reasons why square integrable martingalesdeserve some special attention. Indeed, in the words of Williams [Wil91]: ‘When it works, one ofthe easiest ways of proving that a martingale M is bounded in L1 is to prove that it is boundedin L2...’

Another reason is that square integrable naturally have finite second moments which is necessaryif you might hope for a central limit theorem to hold at all. As it will turn out, we can actuallyprove such a central limit theorem, which is yet another manifestation that martingales can beconsidered as a generalisation of sums of i.i.d. variables.4

4Indeed, in [Dre19] we had only studied a central limit theorem for i.i.d. random variables. There are, however,pretty general versions of central limit theorems for sequences of dependent random variables also (which are nottotally easy to prove), see e.g. Section 7.7 of [Dur10]. We will focus here on the case of martingales since it fitsour studies and still comes relatively simple, but already with some important ideas.

41

Theorem 2.3.18. If (Mn) is a square integrable martingale with M0 = 0 then

E[ supn∈N0

M2n] ≤ 4E[〈M〉∞].

Proof. For p = 2, Doob’s Lp-inequality (Theorem 2.1.23) yields

E[

supk∈0,...,n

M2k

]≤ 4E[M2

n] = 4E[〈M〉n],

and taking n → ∞ on both sides as well as applying the monotone convergence theorem, weobtain the desired result.

Corollary 2.3.19. In the setting of Theorem 2.3.18, on the set 〈M〉∞ <∞, P-a.s. the limitlimn→∞Mn exists and is finite.

Proof. Again denoting by (〈M〉n) the square variation process of (Mn), we get for K ∈ (0,∞)that τK := infn ∈ N : 〈M〉n+1 > K is a stopping time (since (〈M〉n) is a predictableprocess). Hence, the optional sampling theorem yields that MτK∧n is a (in fact square integrable)martingale (with square variation 〈MτK∧·〉n = 〈M〉τK∧n ≤ K for all n ∈ N) and thus Theorem2.3.18 implies that

E[

supk∈0,...,n

M2τK∧n

]≤ 4K.

Thus, Theorem 2.3.16 implies that limn→∞MτK∧n exists and is finite P-a.s. Taking K →∞ theresult follows.

Corollary 2.3.20. Let (Mn) be a square integrable martingale with square variation process(〈M〉n). The following are equivalent:

(a)

limn→∞

E[〈M〉n] <∞;

(b)∞∑i=1

E[(Mi+1 −Mi)2] <∞;

(c)

supn∈N

E[M2n] <∞;

(d) (Mn) converges in L2;

(e) (Mn) converges P-a.s. and in L2.

Proof. The equivalence of (a), (b), and (c) follows from (2.0.12).

The implications (e) implies (d) implies (c) are obvious.

The implication (c) implies (e) follows from Theorem 2.3.16.

2.3.3 A central limit theorem for martingales

Here, as in other contexts, it turns out conceptually useful to consider the increments of amartingale (also called martingale differences) instead of the martingale itself. This is capturedin the following definition, and it also underlines the interpretation of a martingale of sums ofindependent random variables.

42

Definition 2.3.21. If (Xn), n ∈ N, is a stochastic process with Xn ∈ L1 and adapted to (Fn),then it is called a martingale difference sequence if E[Xn+1 | Fn] = 0 for all n ∈ N.

If for each n ∈ N we have a (finite) sequence of random variables (Xn,k), 1 ≤ k ≤ mn, somemn ∈ N, with Xn,k ∈ L1, and a corresponding filtration (Fn,k), 1 ≤ k ≤ mn, such that for alln ∈ N we have that for all k ∈ 1, . . . ,mn, the equation

E[Xn,k | Fn,k−1] = 0

holds true (with Fn,0 := Ω, ∅), then we call the pair (Xn,k), (Fn,k), 1 ≤ k ≤ mn, a martingaledifference array.

In particular, if (Mn) is a martingale with respect to (Fn), then Xn := Mn−Mn+1 is a martingaledifference sequence. Hence, in our heuristic correspondence of martingales to generalisations ofsums of independent random variables, martingale differences correspond to the independentrandom variables themselves.

Theorem 2.3.22. Let (Xn,k), 1 ≤ k ≤ mn, some mn ∈ N, and (Fn,k), 1 ≤ k ≤ mn, as abovebe a martingale difference array, and set

Sn,k :=

k∑i=1

Xn,i, n ∈ N, k ∈ 0, 1, . . . ,mn.

If

E[ maxi∈1,...,mn

|Xn,i|]→ 0 as n→∞, (2.3.13)

and if in addition for some σ2 ∈ (0,∞) we have that

mn∑i=1

X2n,i → σ2 in probability as n→∞, (2.3.14)

then

Sn,mnL−→ N (0, σ2) as n→∞.

Remark 2.3.23. In order to put this result into context, we compare it to the standard CLTwe know from ‘Probability Theory I’, see Theorem 3.8.1 of [Dre19] (with d = 1 for the sake ofsimplicity).

Indeed, for (Xn) a sequence of i.i.d. random variables with σ2 := E[X21 ] ∈ (0,∞), [Dre19,

Theorem 3.8.1] states that ∑ni=1(Xi − E[Xi])√

n

converges in distribution to a N (0, σ2)-distributed random variable.

This can be recovered from the previous result by setting mn := n, Xn,i := (Xi − E[Xi])/√n,

1 ≤ i ≤ n, and Fn,k := σ(X1, . . . , Xk) for any 1 ≤ k ≤ n. It is easy to check that this actuallydefines a martingale difference array, and furthermore we have

E[

maxi∈1,...,n

|Xn,i|]

=

∫ ∞0

P( maxi∈1,...,n

X2i ≥ ns2) ds =

∫ ∞0

1− P(X21 < ns2)n ds.

But since X1 ∈ L2, we have

E[X21 ] =

∫ ∞0

P(X21 ≥ s) ds <∞,

43

whence the integral test of convergence (‘Integralkriterium’) yields P(X21 ≥ s) ∈ o((s ln s)−1), so

continuing the above we get

E[

maxi∈1,...,n

|Xn,i|]≤∫ ∞

01−(

1− 1

1 ∨ n(s2 ln(ns2))

)nds ≤

∫ ∞0

1

1 ∨ (s2 ln(ns2))ds→ 0 as n→∞.

Furthermore, the (weak) law of large numbers yields

n∑i=1

X2n,i =

1

n

n∑i=1

X2i → E[X2

1 ] = σ2 in probability as n→∞,

which establishes (2.3.14).

Before we can actually prove this result we need the following lemma.

Lemma 2.3.24. Let (Xn) and (Yn) be two sequences of random variables such that the followingthree conditions hold true:

(a) There exists a constant c ∈ R such that

XnP−→ c;

(b) the sequences (Yn) and (XnYn) are both uniformly integrable;

(c)E[Yn]→ 1 as n→∞.

ThenE[XnYn]→ c as n→∞.

Proof of Theorem 2.3.22. Dividing all elements of the entire array by√σ2 we can assume σ2 = 1

without loss of generality.In order to obtain a good quality of the Taylor expansion below, in a standard cutoff procedurewe replace the original array by the random variables

Xn,1 := Xn,1, Xn,k := Xn,k1∑k−1i=1 X

2n,i≤2, k ∈ 2, . . . ,mn.

Then (Xn,k) still constitutes a martingale difference array since we have that

k−1∑i=1

X2n,i ≤ 2

∈ Fn,k−1.

Introducing the stopping times

τn := infk ∈ N :

k∑i=1

X2n,i > 2

∧mn

we infer that

P(Xn,i 6= Xn,i some i ∈ 1, . . . ,mn

)= P(τn < mn) ≤ P

( mn∑i=1

X2n,i > 2

)→ 0 as n→∞,

where the convergence follows in combination with (2.3.14). Hence it is sufficient to prove that

Sn :=

mn∑i=1

Xn,i

44

converges in distribution to N (0, 1). Now recall that due to Theorem 3.8.10 of [Dre19], for thispurpose it is sufficient to show that

E[eitSn ]→ e−t2/2 as n→∞.

It turns out that a suitable way to write the Taylor expansion of the exponential function around0 in this context is given by

eix = (1 + ix)e−x2

2+ε(x),

where ε describes the error and satisfies

|ε(x)| ≤ C|x|3 (2.3.15)

for some constant C ∈ (0,∞) and all |x| small enough. As a consequence,

eitSn =( mn∏i=1

(1 + itXn,i)︸ ︷︷ ︸=:Yn

)exp

− t2

2

mn∑i=1

X2n,i +

mn∑i=1

ε(tXn,i)

︸ ︷︷ ︸=:Xn

. (2.3.16)

We now show that the assumptions of Lemma 2.3.24 hold true. Indeed, we have∣∣∣ mn∑i=1

ε(tXn,i)∣∣∣ ≤ C|t|3 mn∑

i=1

|Xn,i|3 ≤ C|t|3mn∑i=1

|Xn,i|3 ≤ C|t|3 maxi∈1,...,mn

|Xn,i|mn∑i=1

|Xn,i|2P−→ 0,

where in the first inequality we used (2.3.15), and in the convergence we took advantage of(2.3.14) and (2.3.13), where the latter implied that for ε > 0 we have

P(

maxi∈1,...,mn

|Xn,i| ≥ ε)≤ ε−1E

[max

i∈1,...,mn|Xn,i|

]→ 0 as n→∞.

Using (2.3.14) again, this implies

XnP−→ e−

t2

2 ,

and hence assumption (a) of Lemma 2.3.24 is fulfilled with c = e−t2

2 .Now by definition we have |Xn,k| ≤ 2 for all n ∈ N, k ∈ 1, . . . ,mn, and so we deduce incombination with the tower property, the fact that we’re dealing with a martingale differencesequence, and Theorem 1.0.9 (e) that

E[Yn] = E[E[(1 + itXn,mn)

mn−1∏i=1

(1 + itXn,i)∣∣Fn,mn−1

]]= E

[mn−1∏i=1

(1 + itXn,i)E[(1 + itXn,mn)

∣∣Fn,mn−1

]︸ ︷︷ ︸

=1

]

= E[mn−1∏

i=1

(1 + itXn,i)]

= . . . = 1.

In particular, this means that (c) of Lemma 2.3.24 holds true.Also, note that the sequence (XnYn) is uniformly integrable since we have |XnYn| = 1 for alln ∈ N due to (2.3.16). Therefore, the only thing to check in order to apply Lemma 2.3.24 isthe uniform integrability of the sequence (Yn). Note that due to

|1 + ix|2 = 1 + x2 ≤ ex2

45

we infer, taking square roots on both sides, that

|Yn| =τn∏i=1

|1 + itXn,i| ≤ exp t2∑τn−1

i=1 X2n,i

2

(1 + |tXn,τn |) ≤ et

2(1 + |t| max

i∈1,...,mn|Xn,i|).

But now note that the sequence (maxj∈1,...,mn |Xn,j |) is uniformly integrable due to Theorem2.3.9 since it converges to 0 in L1, and hence the same applies to the right-hand side and thusthe left-hand side of the previous display.

Remark 2.3.25. In exactly the same way as Example 3.8.12 of [Dre19], the above Theorem2.3.22 implies that Sn,mn/

√mn converges in probability to 0, which can be interpreted as a weak

law of large numbers type result.

Since it makes a better fit and will prop up quite naturally, we will see applications of Theorem2.3.22 in Chapter 4 below.

2.4 Backwards martingales

The martingale convergence results we have investigated so far corresponded to more and moreinformation becoming available. In particular, on a heuristic level this means that the case ofa martingale converging to a constant is the exception rather than the rule. However, if yourecall fundamental results such as the law of large numbers, it becomes evident that convergenceto constants also plays a significant role in probability theory, and this is in some sense bettercaptured by the backwards martingales defined below.

The general definition of a martingale given in Definition 2.0.6 did not presume the index setto be the natural numbers. In fact, one of the reasons for keeping things general over there isbecoming apparent in this section.

Definition 2.4.1. If (Mn), n ∈ −N0, (or n ∈ −N) is a martingale with respect to a filtration(Fn), n ∈ −N0, (or n ∈ −N) then (Mn) is also called a backwards martingale (also: reversedmartingale) (‘Ruckwartsmartingal’). Similarly, if it is a sub- or supermartingale, it is called abackwards sub- or supermartingale.

Note that the ‘right-closedness’ property we had established for (uniformly integrable) mar-tingales holds by definition (i.e., due to the martingale property) for backwards martingales:Mn = E[M0 | Fn] for all n ∈ −N0. Equivalently, cf. Example 2.3.2 (f), we observe that back-wards martingales are always uniformly integrable and hence in general as well-behaved as onlythe nicest martingales can be. This will also manifest itself in the convergence theorem belowas well as in its applications.

For a filtration (Fn), n ∈ −N, we define, analogously to F∞, the σ-algebra

F−∞ :=⋂

n∈−N0

Fn = limn→−∞

Fn.

We now directly dive into our main convergence result for backwards martingales, the proof ofwhich also takes advantage of our basic upcrossing lemma, Lemma 2.2.1.

Theorem 2.4.2. Let (Mn), n ∈ −N0, be a backwards martingale. Then

Mn →M−∞ = E[M0 | F−∞] in L1 and P− a.s. as n→ −∞.

46

Similarly, if (Mn), n ∈ −N0, is a backwards submartingale with infn∈−N0 E[Mn] > −∞, then(Mn), n ∈ −N0, is a uniformly integrable family and

M−∞ := limn→−∞

Mn exists in L1 and P− a.s. as n→ −∞,

andM∞ ≤ E[Mn | F−∞] ∀n ∈ −N0.

Proof. We give the proof in the case of a martingale and leave the case of a submartingale to theReader. The proof proceeds quite similarly to the proof of the standard martingale convergencetheorem.Denoting for n ∈ −N0 and a, b ∈ R with a < b by Un[a, b] the number of upcrossings ofMn,Mn+1, . . . ,M0 of the interval [a, b] similarly to Section 2.2, Lemma 2.2.1 provides us with

E[Un[a, b]

]≤ E[(M0 − a)+]

b− a,

Defining the monotone non-decreasing limit U∞[a, b] := limn→−∞ Un[a, b] we infer from theprevious display in combination with the monotone convergence theorem that

E[U∞[a, b]

]≤ E[(M0 − a)+]

b− a.

As in the proof of Theorem 2.2.2 we therefore conclude that the limit M−∞ := limn→−∞Mn

exists P-a.s. in [−∞,∞]. Since the family (Mn), n ∈ −N0, is uniformly integrable due to Example2.3.2 (f), and sinceM−∞ ∈ L1 (check!), we deduce thatMn →M−∞ in L1 as well due to Theorem2.3.9.In order to show that

M−∞ = E[M0 | F−∞], (2.4.1)

as before for F ∈ F−∞ we have due to F−∞ ⊂ Fn that

E[M01F ] = E[Mn1F ] ∀n ∈ −N0,

and taking n → −∞ this becomes E[M01F ] = E[M−∞1F ] due to uniform integrability andTheorem 2.3.9, which by definition of the conditional expectation shows (2.4.1).

Exercise 2.4.3. Show that if in the setting of the previous theorem, M0 ∈ Lp some p ∈ (1,∞),then Mn →M−∞ even in Lp as n→∞.

The above convergence result is sort of the converse to what we had seen before in martingaleconvergence: Indeed, taking n→ −∞ corresponds to having less and less information available.We will give another example of an application of convergence of backwards martingales here.

Example 2.4.4 (Ballot theorem). Consider (Xn), 1 ≤ n ≤ N ∈ N a finite sequence of N0-valued i.i.d. random variables and set Sn :=

∑ni=1Xi. The claim is that

P(Sn < n∀n ∈ 1, . . . , N︸ ︷︷ ︸=:G

|SN ) =(

1− SNN

)+. (2.4.2)

To motivate the name of this problem, consider the case where N voters each consecutively haveto cast a vote for one of two candidates A and B, and where the event Xn = 2 correspondsto the n-th voter casting her vote for candidate A, and the event Xn = 0 corresponds to then-th voter casting her vote for candidate B. Then candidate B leads at time n if and only ifSn < n, so the event in the probability of (2.4.2) corresponds to B having led throughout theentire voting process from the first to the N -th vote.

47

On the event SN ≥ N the statement of (2.4.2) trivially holds true, so assume SN < Nfrom now on. The crucial observation here is that defining for n ∈ 1, . . . , N the randomvariables M−n := Sn/n, they can be interpreted as a backwards martingale with respect to thefiltration characterised by F−n := σ(Sn, Sn+1, . . . , SN ). Indeed, we start with noting that F−n =σ(Sn, Xn+1, . . . , XN ), which in combination with the fact that the (Xi) are i.i.d. implies that for1 ≤ i ≤ n we have E[Xi | F−n] = E[X1 |σ(Sn)] = Sn

n . As a consequence,

E[M−n+1 | F−n] = E[ Sn−1

n− 1| F−n

]= E

[ Snn− 1

− Xn

n− 1| F−n

]=

Snn− 1

− Snn(n− 1)

=Snn

= M−n,

which establishes the martingale property.Furthermore, we claim that setting

τ :=

max

n ∈ 1, . . . , N : Sn ≥ n

, if the set on the RHS is non-empty,

1, otherwise,

we have that −τ is a stopping time with respect to (F−n). Indeed, we have

−τ = −n = τ = n =Sn ≥ n and Sk < k ∀k ∈ n+ 1, . . . , N

∈ F−n

for n ∈ 2, . . . , N, and the case n = 1 can be proven separately by a distinction of casesregarding the definition of τ ; this shows the stopping time property for −τ.Therefore, the optional sampling theorem yields

E[M−τ | F−N ] = M−N = SN/N. (2.4.3)

We now claim that M−τ = 1 on Gc. To check this, note that if Sj+1 < j + 1, then the fact thatthe Xi are non-negative integers implies Sj ≤ Sj+1 ≤ j. Furthermore, since G ⊂ −τ = −1and S1 < 1 implies S1 = 0, we have that M−τ = 0 on G.Combining the two we get

M−τ = 1Gc .

Hence, in combination with F−N = σ(SN ), display (2.4.3) implies

P(G |SN ) = 1− P(Gc |SN ) = 1− SN/N,

which proves the result.

Exercise 2.4.5. Using parts of the results of the previous Example 2.4.4 show that an i.i.d.sequence of random variables (Xn)n∈N with Xn ∈ L1 satisfies a strong law of large numbers, andshow that this convergence holds in L1 as well.In particular, note that this result is as strong as we can get in terms of as weak as possibleintegrability assumptions on (Xn), and in particular it is stronger than Theorem 3.6.9 in [Dre19],where we had to assume Xn ∈ L4.

2.5 A discrete Black-Scholes formula

The Black-Scholes formula for pricing options is a central result of probabilistic financial math-ematics. In this section, we will with the help of martingale theory discuss a discrete version ofthis formula.In our (simple) economy, we will consider two types of securities (documents holding a certainworth): bonds that deliver a fixed rate r of return, and shares/stocks, whose value fluctuates.We will write Sn for the value of a share during the time interval (n, n+ 1) and Bn = (1 + r)nB0

48

for the corresponding value of a bond. A portfolio (An, Vn) at time n consists of An shares andVn bonds. With that, the initial capital of the investor is

x = X0 := A0S0 + V0B0

and through trading during the time interval (0, 1), the investor can obtain any portfolio (A1, V1),which satisfies x = A1S0 + V1B0. The value of this portfolio in the interval (1, 2) is then

X1 := A1S1 + V1B1

and so on. The income in the n-th step is therefore

Xn −Xn−1 = An(Sn − Sn−1) + Vn(Bn −Bn−1).

Since we can write Bn −Bn−1 = rBn−1 and Sn − Sn−1 = RnSn−1 for a fluctuating rate Rn, weobtain

Xn −Xn−1 = rXn−1 +AnSn−1(Rn − r).

It is often easier to consider the discounted value of a portfolio, defined as

Yn := (1 + r)−nXn,

which leads toYn − Yn−1 = (1 + r)−n+1AnSn−1(Rn − r).

In our simplified model, we will only allow Rn to take two possible values, specifically a ∈ (−1, r)and b ∈ (r,∞).Problem: A European option allows the holder of the option to buy a share at time n = N forthe price K. What is a fair price for such an option at time n = 0?A hedging strategy with starting value x for the option is a predictable process (An, Vn) : n ∈ N)as described above, such that Xn ≥ 0 for all n ∈ 0, . . . N and such that

XN = (SN −K)+.

Note that the left hand side is precisely the value of portfolio at time N , while the right handside is the value of the option at time N . Black and Scholes postulated that x is a fair price forthe option precisely when there exists a portfolio with starting value x, whose value at time Nmatches the value of the option, independently of the evolution of the underlying stock price.Note that this statement does not make any assumption on the choice of the underlying model.

Theorem 2.5.1. A hedging strategy with starting value x exists if and only if

x = E[(1 + r)−N (SN − k)+],

where the expectation is with respect to i.i.d. Rn, that take value b with probability p = r−ab−a and

value a with probability 1− p = b−rb−a .

Proof. Assume that a trading strategy with starting value x exists. We consider a sequence ofi.i.d. random variables ε1, ε2, . . . with P(εi = 1) = p = 1− P(εi = −1) and set

Zn =

n∑k=1

(εk − 2p+ 1).

With this choice (Zn) is a martingale and we can express Rn from our model as

Rn = r +1

2(b− a)(Zn − Zn−1).

49

Using this we can consider Yn as a stochastic integral, namely

Yn = Y0 + (F • Z)n,

where Fn = (1 + r)−n+1AnSn−1. Importantly, such Fn is predictable and bounded. UsingTheorem 2.0.15 we obtain that Yn is therefore also a martingale. For Y0 = x it follows

x = E[YN ] = E[(1 + r)−NXN ] = E[(1 + r)−N (SN −K)+],

as claimed. Note that non-negativity of Xn was not required.We now show the converse - that for our choice of x a hedging strategy exists. We definetherefore

Yn := E[(1 + r)−N (SN −K)+ | Fn].

With this definition (Yn) is a martingale and it remains to show that a predictable process (An)exists, for which

Yn − Yn−1 = (1 + r)−n+1AnSn−1(Rn − r).

Assuming this existence, we can then set Xn = (1 + r)nYn and Vn = (Xn − AnSn)/Bn. Thisensures that the equations that connect Xn with (An, Vn) hold and furthermore that ((An, Vn))is predictable. Since X0 = Y0 = x and XN = E[(SN −K)+ | FN ] = (SN −K)+ we have foundthe desired hedging strategy. It remains to show the existence of An, which is implied by thefollowing lemma.

Lemma 2.5.2. Let (Mn : 0 ≤ n ≤ N) be a martingale with respect to the filtration Fn =σ(ε1, . . . , εn) and let Zn =

∑nk=1(εk − 2p+ 1). Then there exists a predictable process (Hn) such

that

Mn = M0 +n∑k=1

Hk(Zk − Zk−1).

Proof. Since Mn is measurable with respect to Fn, there exists a function

fn : −1, 1n → R

such that Mn = fn(ε1, . . . , εn). The martingale property gives

0 = E[Mn −Mn−1 | Fn−1]

= pfn(ε1, . . . , εn−1, 1) + (1− p)fn(ε1, . . . , εn−1,−1)− fn−1(ε1, . . . , εn−1).

Setting

Hn =fn(ε1, . . . , εn−1, 1)− fn−1(ε1, . . . , εn−1)

2(1− p)

=fn−1(ε1, . . . , εn−1)− fn(ε1, . . . , εn−1,−1)

2p,

the result follows by simple algebra.

2.6 A discrete Ito-formula

Ito’s formula (also known as Ito’s-Doeblin theorem in recognition of posthumously publishedwork of Wolfgang Doeblin) is a useful tool, in particular in stochastic analysis and applicationsto mathematical finance. Since this will not be the subject of this course, we will restrict ourselvesto giving a discrete version of it, which already provides some insight and intuition. Indeed,when (Mn) is a martingale and we investigate the process f(Mn) obtained by applying a nice

50

function to the original process (Mn), then Ito’s lemma supplies us with the Doob decompositionof this process into a martingale (expressed as a stochastic integral) and a previsible process –this usually helps in obtaining a better understanding of the process.From a different point of view, Ito’s lemma can also be seen as a generalisation of the Fun-damental Theorem of Calculus which says that if f : R → R is continuously differentiable,then

f(x) =

∫ x

0f ′(s) ds+ f(0).

In particular, this tells us that we can write the function f as the integral of an appropriatefunction, namely its derivative. In view of the discrete stochastic integral introduced in Definition2.0.14 one might hope that we can also write

f(Xn) = (f ′ •X)n + C (2.6.1)

for some constant C and a martingale (Xn). As it turns out, this is not entirely true but we willbe able to recover a corresponding result.In the following it suggests itself to either consider continuous time processes, i.e., processes (Xt),t ∈ [0,∞), or otherwise Z-valued processes in discrete time, and the latter is what we will actuallystick to for the remaining part of this section. In order to formulate a result corresponding tothe Fundamental Theorem of Calculus we have to introduce the discrete derivative of a functionf : Z→ R in a standard way as

f ′(x) =f(x+ 1)− f(x− 1)

2, x ∈ Z,

as well as its second derivative (or Laplace operator)

f ′′(x) = f(x+ 1)− 2f(x) + f(x− 1), x ∈ Z.

Now let (Xn) be a Z-valued process with Xn+1 − Xn ∈ −1, 1 for all n ∈ N0. Decomposingaccording to the possible increments we obtain that

f(Xn+1)− f(Xn) =1

2

((f(Xn + 1)− f(Xn − 1))(Xn+1 −Xn) + f(Xn + 1) + f(Xn − 1))

)− f(Xn)

= f ′(Xn)(Xn+1 −Xn) +1

2f ′′(Xn),

and hence summation yields the discrete Ito formula

f(Xn) =n−1∑i=0

f ′(Xi)(Xi+1−Xi) +1

2

n−1∑i=0

f ′′(Xi) + f(X0) = (f ′(X·) •X)n +1

2

n−1∑i=0

f ′′(Xi) + f(X0).

(2.6.2)The first term on the right-hand side is the stochastic integral, but in addition to the termsexpected in (2.6.1), we see that there is an additional term 1

2

∑n−1i=0 f

′′(Xi) where the sum canalso be interpreted as a (discrete) integral.We observe that if (Xn) is a martingale with X0 = c some constant c ∈ Z, then since theprocess (f ′(Xn)) is adapted, due to Theorem 2.0.15 the process (f ′(X·) •X)n is a martingale,too (observe here that f ′(Xn) is bounded due to X0 = c and |Xn+1−Xn| = 1). In addition, theprocess (1

2

∑n−1i=0 f

′′(Xi)) is previsible. Hence, as a consequence of Theorem 2.0.17 we deducethat the Doob decomposition of the process (f(Xn)) is given by

f(Xn) =( n−1∑i=0

f ′(Xi)(Xi+1 −Xi) + f(X0))

+1

2

n−1∑i=0

f ′′(Xi),

with the term in brackets on the right-hand side being a martingale and the last summand beinga previsible process. In particular, if f is convex (in a discrete sense), then the second derivativeof f is non-negative and the above yields that f(Xn) is a submartingale due to Theorem 2.0.17(which otherwise we had already concluded using Jensen’s inequality in Theorem 2.0.11).

51

52

Chapter 3

Markov processes

See Section 1.17 of [Dre18] for a short introduction to Markov chains in discrete time. Wewill mostly focus on time continuous Markov chains in this section. Although strictly notrequired for understanding the results of this chapter, it is useful and guiding for intuition tohave a look at Section 1.17 of [Dre18] to obtain some basic knowledge in discrete time Markovchains. For those of you who do know Markov chains in discrete time, continuous time Markovchains can essentially be thought of as discrete time Markov chains which are run through atrandom jump times (instead of the deterministic jump times of N in the case of discrete timeMarkov chains). It will turn out that in order to enjoy the crucial property that the transitionprobabilities of a Markov chain depend on the entire past solely through the state of the present,the times in between jumps will necessarily have to be distributed according to a sequence ofi.i.d. exponentially distributed random variables.We start with a generalisation of the Markov property for discrete time stochastic processes thathad been treated Section 1.17 of [Dre18] as follows.

Definition 3.0.1. Let (Xt), t ∈ T , be a stochastic process taking values in a Polish space (S,O)(endowed with the Borel σ-algebra) and let (Ft), t ∈ T , denote the filtration generated by theprocess (Xt). Then (Xt) has the Markov property if for any 0 < s < t, and B ∈ B(S), withs, t ∈ T,

P(Xt ∈ B | Fs) = P(Xt ∈ B |Xs). (3.0.1)

The Reader is invited to check that in the case of countable S and T = N0, a stochastic processexhibits the Markov property with respect to its canonical filtration if and only if it is a Markovchain as introduced in [Dre18].It can be shown that one of the properties equivalent to the Markov property is that ‘the pastand the future are conditionally independent given the present.’

Exercise 3.0.2. (a) Random walk as introduced in Example 2.0.8 (a) has the Markov property(check!).

(b) Show that if f : S → R is a measurable function and the stochastic process (Xt) has theMarkov property then f(Xt) does not necessarily have the Markov property.

(c) Recall the setting of Section 1.17 of [Dre18], where S was a finite or countable set. A se-quence of random variables (Xn) on (Ω,F ,P) was called a homogeneous Markov chainwith state space S and transition matrix P = (P (x, y))x,y∈S if for all n ∈ N0 andx0, x1, . . . , xn+1 ∈ S, one has

P (xn, xn+1) = P(Xn+1 = xn+1 |Xn = xn)

= P(Xn+1 = xn+1 |X0 = x0, X1 = x1, . . . , Xn = xn),(3.0.2)

whenever P(X0 = x0, X1 = x1, . . . , Xn = xn) > 0.

Show that (Xt) has the Markov property.

53

Exercise 3.0.3. Let T = N0 and S = R. In this context, is any martingale a Markov process,and vice versa? Find counterexamples or proofs!

Remark 3.0.4. For T = N0, every stochastic process can be turned into a Markov chain (here: astochastic process exhibiting the Markov property) by considering the state space ∪n∈NSn instead,and by considering the process (Xn), n ∈ N0, defined via

Xn := (X0, . . . , Xn).

Intuitively, the above display (3.0.1) tells us that that knowing all the past of the process (Xt) upto time s does not provide us with any advantage over only knowing Xs. Also, equation (3.0.1)naturally leads to the concept of regular conditional probabilities. Indeed, the factorisationlemma implies that for each B ∈ B(S), there exists a B(S)−B-measurable function ϕB : S → Rsuch that

P(Xt ∈ B | Fs) = ϕB(Xs),

where, as always, this equality is to be understood in an P-a.s. sense. It would be nice, however,to have that this identity is ‘well-behaved’ in B also, in the sense that the right-hand side, forXs fixed, defines a probability measure in B also. This is the reason for us to insert the followingsection here.

3.1 Interlude: Regular conditional distributions

Recall from Chapter 1 that we defined the conditional probability of B ∈ F given a sub-σ-algebraG as P(B | G)(ω) = E[1B | G](ω) for almost all ω ∈ Ω. Therefore, for almost all ω, P(B | G)(ω)satisfies countable additivity due to Theorem 1.0.9 (a) and (b). However, the set of ω ∈ Ω withmeasure 0 where countable additivity fails might depend on the sequence of sets in question,and there might be uncountably many such sequences, so that their union covers all of Ω. Toavoid this problem, we must choose the conditional probabilities carefully, which we do in thissection.

Definition 3.1.1. Let (Ω1,F1) and (Ω2,F2) be measurable spaces. A mapping

µ : Ω1 ×F2 → [0, 1]

is called transition kernel or stochastic kernel (from (Ω1,F1) to (Ω2,F2)) if the following prop-erties hold true:

(a) For each F ∈ F2 the mappingΩ1 3 ω 7→ µ(ω, F )

is F1 − B-measurable;

(b) For each ω ∈ Ω1, the mappingF2 3 F 7→ µ(ω, F )

defines a probability measure on F2.

Example 3.1.2. If (S, 2S) is the countable state space (endowed with its power set) of a (timehomogeneous) Markov chain with transition matrix P, then we can define the transition kernel

p : S × 2S 3 (x,B) 7→∑y∈B

P (x, y)

from (S, 2S) to itself. In this sense, transition kernels can be considered generalisations of tran-sition matrices for Markov chains. As we will see in Theorem 3.1.19 below, however, transition

54

kernels can be used to construct more general stochastic processes than just Markov processes, bynot only taking elements from S as an argument (corresponding to the state of a Markov chainat time n − 1), but by instead considering the n-fold product space S0,...,n−1 (correspondingto the entire history of a Markov chain up to time n − 1). This will then be a kernel from

(S0,...,n−1, 2S0,...,n−1

) to (S, 2S).

Definition 3.1.3. Let (Ω,F ,P) be a probability space, (E, E) a measurable space, and X :(Ω,F)→ (E, E) a random variable, as well as G a sub-σ-algebra of F . A function µ : Ω× E →[0, 1] is called a regular conditional distribution for X given G if

(a) for each F ∈ E , the function Ω 3 ω 7→ µ(ω, F ) is a version1 of the conditional probabilityP(X ∈ F | G) (in particular, µ(·, F ) is a G − B(R)-measurable function);

(b) for P-almost all ω ∈ Ω,

E 3 F 7→ µ(ω, F )

is a probability measure on E .

In other words, a regular conditional distribution for X given G is a stochastic kernel µ suchthat for P-almost all ω ∈ Ω and all F ∈ E ,

µ(ω, F ) = P(X ∈ F | G)(ω) (3.1.1)

In the case (E, E) = (Ω,F) and X the identity map, µ is also called a regular conditionalprobability.

Indeed, note that for any fixed F ∈ E , we can just define the stochastic kernel µ via (3.1.1).However, it is far from clear that there exists a set Ω ∈ F of probability 1 such that for all ω ∈ Ωand all F ∈ G, display (3.1.1) holds true. Indeed, there are settings where a regular conditionalprobability distribution does not exist – see [Fad85] for conditions of on the existence of regularconditional probability distributions.

Theorem 3.1.4. Let (S,O) be a Polish space and let X : (Ω,F ,P) → (S,O) be a randomvariable. Furthermore, let G ⊂ F be a sub-σ-algebra.

Then a conditional distribution µ of X given G exists. Furthermore, it is unique in the sensethat if µ′ is another conditional distribution of X given G, then for P-almost all ω, the measuresµ(ω, ·) and µ′(ω, ·) coincide.

As a consequence, we call any such regular conditional distribution the regular conditionaldistribution.

In order to prove this result, we introduce the following notion: For A ⊂ C two algebras ofsubsets of some Polish space (S,O) and a finite, additive, non-negative function µ defined on C,we say that µ is regular on A for C, if for each A ∈ A we have

µ(A) = supµ(C) : C ∈ C with C ⊂ A and C compact

.

We will also take advantage of the following auxiliary result.

Lemma 3.1.5. If in the above context µ is regular on A for C, then µ is σ-additive on A.

Proof. Assume that µ is not σ-additive. Then by Proposition 1.2.15 of [Dre19] we infer that µcannot be continuous in ∅ either, so there exists δ > 0 and a sequence of sets (An) with An ∈ A

1I.e, we have P(ω ∈ Ω : µ(ω, F ) 6= P(X ∈ F | G)(ω)) = 0. In particular, recall that the random variableP(X ∈ F | G) is only uniquely defined as an element in L1, not in L1. So the previous can be formulated asµ(·, F ) = P(X ∈ F | G) in L1.

55

and An ↓ ∅, as well as µ(An) > δ for all n ∈ N. Again, since µ is regular on A for C, we have thatfor each n ∈ N there is Kn ⊂ An compact with Kn ∈ C and µ(An\Kn) ≤ δ2−n−1. This implies

µ( n⋂i=1

Ki

)≥ µ(An)−

n∑i=1

δ2−i−1 ≥ δ/2.

In particular, all the ∩ni=1Ki are non-empty, and

∞⋂i=1

Ki 6= ∅ (3.1.2)

as well. Indeed, if it wasn’t, then the (∩ni=1Ki)c, n ∈ N, would form an open covering of K1,

so since K1 is compact, there would exist a finite subcover, which would imply that one of the∩ni=1Ki, n ∈ N would have to be disjoint from K1, a contradiction, so (3.1.2) holds true. This,however, is a contradiction to An ↓ ∅. Hence, µ must be σ-additive.

Proof of Theorem 3.1.4. We start with observing that since (S,O) is Polish, B(S) is countablygenerated, e.g. by the set U := B(x, 1

n) : x ∈ S, n ∈ N, where S ⊂ S is any countable densesubset of S and B(x, 1

n) is the open ball around x of radius 1n with respect to any (but a fixed

one) of the metrics which induce the topology O. Therefore, the algebra V generated by U canbe written as a countable union of non-decreasing, finite algebras Vn. By the regularity Lemma4.2.3 from [Dre19] (Ulam’s theorem), for each B ∈ V there exists a non-decreasing sequence(BB

j ) with BBj ⊂ B, BB

j compact, such that

PX(BBj ) ↑ PX(B) as j →∞.

Since finite unions of compact sets remain compact, we can choose a specific sequence (BBj )

for B such that BB1 ⊂ BB

2 ⊂ · · · and hence by the conditional monotone convergence theorem(Theorem 1.0.9 (d)) we deduce that

PX(BBj | G) ↑ PX(B | G) P− a.s. (3.1.3)

But now the union of V with all the possible BBj , B ∈ V, j ∈ N, is countable again, and we denote

the countable algebra that is generated by this union by A. For a given choice of PX(A) | G)(ω)for all A ∈ A and all ω ∈ Ω, which is G-B(S)-measurable we have

(a) For each A ∈ A,

PX(A | G)(ω) ≥ 0 almost surely.

(b)

PX(S | G)(ω) = 1 a.s., and PX(∅ | G)(ω) = 0 a.s.

(c) For any finite sequence of pairwise disjoint sets A1, . . . , An with Aj ∈ A for all j ∈1, . . . , n, and almost all ω we have

PX( n⋃j=1

Aj∣∣G)(ω) =

n∑j=1

PX(Aj | G)(ω);

(d) For almost all ω and all sequences (BBj ), j ∈ N, B ∈ V, the convergence in (3.1.3) holds

true.

56

Items (a) to (d) consist of countably many equations or inequalities in terms of G-B(S)-measurable functions, including limits of sequences of G-B(S)-measurable functions, each holdingalmost surely. Therefore, there exists some Ω ∈ G for which P(Ω) = 0 and for which we canreplace the a.s. statements in (a) to (d) with “for all ω ∈ Ω”.Now by definition of Ω, for ω ∈ Ω, the function PX(· | G)(ω) is regular on V for A, and hence isσ-additive on the algebra V by Lemma 3.1.5. By Corollary 1.3.9 from [Dre19], we can extendthis function uniquely to a probability measure µ(ω, ·) on B(S).From this we construct the regular conditional distribution as follows. Denote by D the set ofall B ∈ B(S) such that

ω 7→ µ(ω,B)

defines a version of the conditional probability PX(B | G). Then due to the monotone convergencetheorem for conditional expectations, D is a Dynkin system, and furthermore it contains V.Hence, due to Dynkin’s π − λ-Theorem (see Theorem 1.1.30 in [Dre19]), D = O, and thereforeµ as above, extended to being equal to 0 on Ω\Ω, defines a regular conditional distribution.As regards to the uniqueness property, observe that if µ′ is another regular conditional distribu-tion of X given G, then P-a.s. the two functions coincide on the countable algebra A. But sinceA is a π-system generating B(S), Dynkin’s π − λ-theorem again yields that P-a.s.,

µ(·, B) = µ′(·, B) ∀B ∈ B(S),

which proves the desired uniqueness.

Remark 3.1.6. In fact, the previous result not only holds for Polish spaces (S,O), but we caneven replace (S,O) by a standard Borel space. A measurable space (E, E) is called a (standard)Borel space if there exists a bijection ϕ : E → B, some B ∈ B(R), such that ϕ and ϕ−1 areboth measurable. See e.g. Theorem 4.1.6 in [Dur10] or Theorem 8.37 in [Kle14]. Indeed, wecan also deduce this without much effort from our general result for Polish spaces above, sincein particular R is a Polish space.

Example 3.1.7. (a) Let X,Y be real random variables such that the random vector (X,Y )has probability density f with respect to the two-dimensional Lebesgue measure λ2 on B2.

Then the marginal density of X (with respect to the one-dimensional Lebesgue measure λ)is given by

fX(x) :=

∫Rf(x, y) dy,

and for each x ∈ R with fX(x) ∈ (0,∞) and y ∈ R we define the conditional density of Ygiven X = x via

fY |X(y |x) :=f(x, y)

fX(x).

Defining µ(x, ·) as the probability measure on B with density fY |X(· |x) if fX(x) ∈ (0,∞),and as an arbitrary but fixed probability measure on B otherwise, we obtain a regularconditional distribution of Y given X.

(b) We can now also make precise our motivating Example 1.0.2 from the very beginning.There we considered the situation of two independent Gaussian random variables X andY distributed according to N (0, 1), and setting Z := X + Y we wanted to make sense of

P(Z ∈ · |X = x). (3.1.4)

Take Ω := R2, F := B2, as well as X(x, y) := x, Y (x, y) := y, and

P =1

2πe−x2−y2

2 · λ2.

57

In this context, we interpret (3.1.4) as the regular conditional distribution µ of Z given Xevaluated at (x, y), y ∈ R arbitrary (i.e., µ((x, y), ·)), and we claim that it is given by

µ((x, y), B) :=1√2π

∫Be−(z−x)2

2 dz, B ∈ B, (x, y) ∈ R2.

Indeed, for each (x, y) ∈ R2 we have that µ((x, y), ·) defines a probability measure on B, soPart (b) of Definition 3.1.3 holds true.

On the other hand, for each B ∈ B2 we have that (x, y) 7→ µ((x, y), B) is σ(X) − B-measurable (recall that σ(X) = B ⊗ R, ∅, and by Tonelli’s theorem, ‘integrating out oneof the variables of a measurable function keeps measurability of the resulting integral in theother variables’), and furthermore, for A ∈ σ(X) we have A = A1 × R, some A1 ∈ B, so∫

Aµ((x, y), B)P(dx, dy) =

1

∫A

1√2π

(∫Be−(z−x)2

2 dz)e−x2−y2

2 λ2(dx,dy)

=1

∫A1×B

e−(z−x)2−x2

2 λ2(dx,dz).

(3.1.5)

On the other hand,

P[Z ∈ B ∩A] =1

∫1x+y∈B1x∈A1e

−x2−y2

2 λ(dy)λ(dx)

y 7→y−x=

1

∫A1×B

e−x2−(y−x)2

2 λ2(dx,dy),

so it equals (3.1.5), and hence Part (a) of Definition 3.1.3 holds true as well.

Theorem 3.1.8 (Disintegration formula). Let (E, E) be a measurable space, and let X :(Ω,F ,P) → (E, E) be a random variable. Furthermore, let f : (E, E) → (R,B) be measur-able with f X ∈ L1, and let G be a sub-σ-algebra of F . Then f is integrable with respect toµ(ω, ·) for P-a.a. ω ∈ Ω, and

E[f X | G](ω) =

∫Ef(x)µ(ω,dx) P− almost all ω ∈ Ω, (3.1.6)

where µ denotes the regular conditional distribution of X given G.

Proof. Due to the linearity of (3.1.6) in f , and since f = f+ − f−, we can assume w.l.o.g. thatf ≥ 0.If f = 1B some B ∈ E , then by the definitions, for P-almost all ω ∈ Ω we have

E[f X | G](ω) = P(X ∈ B | G)(ω)

and ∫f(x)µ(ω,dx) = µ(ω,B) = P(X ∈ B | G)(ω),

so the claim holds true, and due to linearity it also extends to simple functions. Choose asequence (fn) of simple functions with fn ≥ 0, and fn ↑ f . Hence, the monotone convergencetheorem for conditional expectations (Theorem 1.0.9(b)) implies that

E[fn X | G] ↑ E[f X | G]

P-almost surely. On the other hand, for any ω fixed we have that∫fn(x)µ(ω,dx) ↑

∫f(x)µ(ω,dx)

due to the standard monotone convergence theorem, and the result follows.

58

An application of the previous result is given by the following exercise.

Exercise 3.1.9. Regular conditional probability distributions can be used to derive a conditionalHolder inequality: For X,Y real-valued random variables and G ⊂ F , and p, q ∈ (1,∞) withp−1 + q−1 = 1, we have

E[|XY | | G] ≤ E[|X|p | G]1pE[|Y |q | G]

1q .

3.1.1 Properties of transition kernels

While we have seen that Kolmogorov’s existence and uniqueness result is quite helpful in con-structing stochastic processes once we have a consistent family of probability measures, it turnsout that oftentimes (and in particular in the setting of Markov processes) a more hands-onapproach is more useful: Instead of starting with a consistent family of probability measures, itmight be easier to start from a sequence of transition kernels.We start with the following basic lemma.

Lemma 3.1.10. Let µ be a transition kernel from (E1, E1) to (E2, E2), and let f : (E1×E2, E1⊗E2)→ ([0,∞],B([0,∞])) be a measurable function. Then the mapping

ϕf : E1 3 x1 7→∫f(x1, x2)µ(x1,dx2) ∈ [0,∞]

is well-defined and E1 − B([0,∞])-measurable.

Proof. Obviously the mapping is well-defined (since x2 7→ f(x1, x2) is measurable for anyx1 ∈ E1 due to a pre-result to Fubini’s / Tonelli’s theorem).As almost always, the first step is to show the statement for f = 1F with F ∈ E1 ⊗ E2. For thispurpose, consider F = F1 × F2 with Fi ∈ Ei, i ∈ 1, 2. Then∫

1F1×F2(x1, x2)µ(x1, dx2) = 1F1(x1)µ(x1, F2),

which evidently is a E1−B([0,∞])-measurable function. We now claim that the set of F ∈ E1⊗E2

for which ϕ1F is E1 − B([0,∞])-measurable forms a Dynkin system D. To start off with, wecertainly have E1×E2 ∈ D since ϕ1Ω(x1) = µ(x1, E2) ∈ [0, 1] which is measurable by assumption.Furthermore, for D ∈ D we have

ϕ1Dc (x1) = ϕ1Ω(x1)− ϕ1D(x1),

which in combination with the previous entails Dc ∈ D. It remains to show the stability of Dunder countable disjoint unions. For this purpose let (Dn), n ∈ N be a sequence with Dn ∈ Dfor all n ∈ N. We then infer

ϕ1⋃n∈NDn

=

∫1⋃

n∈NDn(x1, x2)µ(x1, dx2) =

∑n∈N

∫1Dn(x1, x2)µ(x1, dx2),

where the last equality follows from the monotone convergence theorem. Since each summandon the right-hand side is E1 − B([0,∞])-measurable, we obtain that the sum is E1 − B([0,∞])-measurable again as a limit of E1 − B([0,∞])-measurable functions.All in all, the above shows that D is a Dynkin system containing the π-system of rectangles inE1 ⊗ E2, which generates E1 ⊗ E2. Hence, Dynkin’s π − λ-Theorem implies D = E1 ⊗ E2.This shows that for any F ∈ E1 ⊗ E2., the function ϕ1F defines a E1 − B([0,∞])-measurablefunction, and by linearity this extends to simple functions. Now any arbitrary measurablefunction f : (E1 × E2, E1 ⊗ E2) → ([0,∞],B([0,∞])) can be written as the monotone pointwiselimit fn ↑ f of simple functions fn. Then, again by monotone convergence,

ϕf (x1) = limn→∞

ϕfn(x1);

Each of the functions on the right-hand side is measurable in x1 and hence so is the limit on theleft-hand side which finishes the proof.

59

Theorem 3.1.11.

Let (Ei, Ei), i ∈ 0, 1, 2, be measurable spaces and let µ1 be a transition kernel from (E0, E0)to (E1, E1), and µ2 be a transition kernel from (E0 × E1, E0 ⊗ E1) to (E2, E2). We define themapping

µ1 ⊗ µ2 : E0 × (E1 ⊗ E2)→ [0, 1]

via

(x0, F ) 7→∫E1

(∫E2

1F (x1, x2)µ2((x0, x1),dx2))µ1(x0,dx1). (3.1.7)

Then µ1 ⊗ µ2 is well-defined and a transition kernel from (E0, E0) to (E1 × E2, E1 ⊗ E2). It isdenoted by µ1 ⊗ µ2 and also referred to as the product of µ1 and µ2.

We get the following immediate corollary. It is tailor-made for the setting where we have aninitial distribution of the stochastic process to construct.

Corollary 3.1.12. Let (Ei, Ei), i ∈ 1, 2, be measurable spaces and let P1 be a probabilitymeasure on (E1, E1), and let µ2 be a transition kernel from (E1, E1) to (E2, E2). We define themapping

P1 ⊗ µ2 : E1 ⊗ E2 → [0, 1]

via

F 7→∫E1

(∫E2

1F (x1, x2)µ2(x1, dx2))P1(dx1). (3.1.8)

Then P1 ⊗ µ2 is well-defined and a probability measure on (E1 × E2, E1 ⊗ E2). It is denoted byP1 ⊗ µ2.

Proof. This follows directly by interpreting P1 as a transition kernel from (E0, E0) to (E1, E1)which is constant in E0.

Also, inductively, Theorem 3.1.11 in combination with the proof of the previous corollary suppliesus with the following result.

Corollary 3.1.13. Let (Ei, Ei), i ∈ 0, 1, . . . , n, be measurable spaces and for i ∈ 1, . . . , n letµi be a transition kernel from (×i−1

j=0Ej ,⊗i−1j=0Ej) to (Ei, Ei).

We can define the mappingn⊗i=1

µi : E0 × (⊗ni=1Ei)→ [0, 1]

as ( n−1⊗i=1

µi

)⊗ µn. (3.1.9)

Then ⊗ni=1µi is well-defined and a transition kernel from (E0, E0) to (×ni=1Ei,⊗ni=1Ei).Also, as above, we can replace the kernel µ1 by a probability measure P1 on (E1, E1) to obtainthat P1 ⊗

⊗ni=2 µi is a probability measure on (×ni=1Ei,⊗ni=1Ei).

Remark 3.1.14. In particular in the context of transition kernels it turns out that it is some-times more convenient and easier to read if we write the integrating kernel directly after theintegral sign (which likely originates in physics). This usually provides a better overview ofwhich integration is performed at which point. For example, the iterated integral on the right-hand side of (3.1.7) would be written as∫

E1

µ1(x0, dx1)(∫

E2

µ2((x0, x1),dx2)1F (x1, x2)).

60

Proof of Theorem 3.1.11. We start with observing that

(E0 × E1, E0 ⊗ E1) 3 (x0, x1) 7→∫E2

1F (x1, x2)µ2((x0, x1),dx2) ∈ ([0,∞],B([0,∞]))

defines a measurable mapping due to Lemma 3.1.10, and applying that lemma once again weinfer that (µ1⊗ µ2)(x0, F ) is well-defined and E1−B-measurable as a function in x0; this showsPart (a) of Definition 3.1.1. Furthermore, by monotone convergence we deduce that for anyx0 ∈ E0 we have that

E1 ⊗ E2 3 F 7→∫E1

(∫E2

1F (x1, x2)µ2((x0, x1),dx2))µ1(x0,dx1)

is σ-additive. Since it is furthermore non-negative and evaluates to 1 for E1 ×E2, we infer thatPart (b) of Definition 3.1.1 if fulfilled. All in all, we conclude that µ1 ⊗ µ2 defines a transitionkernel.

The preceding theorem can be generalised to n-fold products of correspondingly defined kernelsin an obvious way.

Remark 3.1.15. Note that in the special case of µ1 in the above theorem being constant inx0 (so µ1 is just a probability measure on E1), we infer that µ1 ⊗ µ2 is a probability measureon E1 ⊗ E2; in particular, we can therefore define the product of a probability measure and atransition kernel, which yields a probability measure again.

Furthermore, oftentimes the composition of kernels is of interest as well.

Definition 3.1.16. For measurable spaces (Ei, Ei), i ∈ 0, 1, 2, and transition kernels µi from(Ei−1, Ei−1) to (Ei, Ei), i ∈ 1, 2, we define the composition (‘Verkettung’) µ1µ2 of µ1 and µ2

via

µ1µ2 : E0 × E2 3 (x0, F2) 7→∫E1

µ1(x0,dx1)µ2(x1, F2).

The relation between the composition and the product of kernels is given by the following.

Theorem 3.1.17. For measurable spaces (Ei, Ei), i ∈ 0, 1, 2, and transition kernels µi from(Ei−1, Ei−1) to (Ei, Ei), i ∈ 1, 2, we have

µ1µ2(x0, F2) = (µ1 ⊗ µ2)(x0, π−12 (F2)) ∀x0 ∈ E0, F2 ∈ E2, (3.1.10)

where π2 : E1×E2 → E2 denotes the projection π2 : (x1, x2) 7→ x2, and µ1µ2 defines a transitionkernel from (E0, E0) to (E2, E2). I.e., for each x0 ∈ E0, we have that

µ1µ2(x0, ·) = (µ1 ⊗ µ2)(x0, ·) π−12 , ∀x0 ∈ E0, (3.1.11)

so µ1µ2(x0, ·) is the image measure of µ1 ⊗ µ2 under π2.

Remark 3.1.18. Here, as is commonly done, we interpret the transition kernel µ2 as a transi-tion kernel from (E0 × E1, E0 ⊗ E1) to (E2, E2) in order to construct µ1 ⊗ µ2.

Proof. For all x0 ∈ E0 and F2 ∈ E2, we have

(µ1 ⊗ µ2)(x0, π−12 (F2)) =

∫E1

µ1(x0,dx1)

∫E2

1E1×F2(x1, x2)µ2(x1,dx2)

=

∫E1

µ1(x0,dx1)µ2(x1, F2) = µ1µ2(x0, F2),

which establishes (3.1.10).Since π−1

2 (F2) ∈ E1⊗E2, Theorem 3.1.11 provides us with the desired measurability of µ1µ2(·, F2)for F2 ∈ E2. Furthermore, from (3.1.11) we infer that for each x0 ∈ E0, we have that µ1µ2(x0, ·)is a probability measure indeed.

61

Arguably the most intuitive and most accessible version of Ionescu-Tulcea’s theorem is formu-lated in the product space setting. For this purpose, assume the following: Let (En, En), n ∈ N0

be sequence of measurable spaces and let P0 be a probability measure on (E0, E0). For n ∈ N, setΩn := ×ni=0Ei and Ω := ×∞i=0Ei, and define the σ-algebras Fn := ⊗ni=0Ei as well as F := ⊗∞i=0Eion these spaces. Furthermore, let (µn), n ∈ N, be a sequence of transition kernels such that µnis a transition kernel from (Ωn−1,Fn−1) to (En, En). Using Corollary 3.1.13 we observe that

Pn := P0 ⊗n⊗i=1

µi

defines a probability measure on Fn.Furthermore, it is not hard to see inductively that the sequence (Pn), n ∈ N, fulfills the followingconsistency condition: For m,n ∈ N with m < n we have

Pm(Fm) = Pn(Fm ×

n

×i=m+1

Ei

)∀Fm ∈ Fm. (3.1.12)

Therefore, in a way similar to the context of Kolmogorov’s existence and uniqueness theoremin [Dre19] (where we also had an underlying consistency condition), we can and do now enquireabout the existence (and uniqueness) of a probability measure P on (Ω,F) such that

Pm(F ) = P(F ×

×k=m+1

Ei

)∀m ∈ N, F ∈ Fm. (3.1.13)

Theorem 3.1.19 (Ionescu-Tulcea). In the setting described above, there exists a unique prob-ability measure P on (Ω,F) such that (3.1.13) holds true.

Proof. If the (Ei, Ei) were Polish spaces, taking advantage of the consistency condition (3.1.12)we could deduce that the assumptions of Kolmogorov’s existence and uniqueness theorem inorder to deduce the existence of the probability measures that we are looking for here. Indeed,to check the assumptions of Theorem 4.2.1 in [Dre19] in this setting, set Λ := N0, and Eλ as inthe assumptions with their Borel σ-algebra Bλ. For J ⊂ Λ finite, we can write J = λ1, . . . , λnwith λ1 < . . . < λn and define

PJ(BJ) := Pλn(x ∈

λn×i=0

Ei : (xλ1 , . . . , xλn) ∈ BJ), ∀BJ ∈

⊗λ∈JBλ,

i.e.,PJ = Pλn π−1

λ1,...,λn,

where πλ1,...,λn is the coordinate projection from ×λni=0Ei to ×λ∈JEλ, and so PJ defines a probabil-ity measure on (×λ∈JEλ,⊗λ∈JBλ). Then one can check that the family PJ , J ⊂ Λ, is consistentand we can apply Theorem 4.2.1 of [Dre19] in order to deduce the existence of a probabilitymeasure with the desired properties. For the general case of measurable spaces (Ei, Ei) we referto e.g. [Kle14] for a proof.

3.2 Back to Markov processes again

In the previous section the product of transition kernels had been defined in a pretty generalway in Theorem 3.1.11. We will mostly be interested in the simpler case where (Ei, Ei) does notdepend on i, (so we can write (E, E) instead) and the transition kernel µn depends on ×n−1

i=0 Eonly through its last coordinate – this will be crucial for obtaining Markov processes, which wewill do for possibly uncountable index sets T at once.In particular, having the above basic theory of transition kernels and induced distributions atour disposal, we can introduce the following definition.

62

Definition 3.2.1. As before, let T ⊂ R be a semigroup and assume a measurable space (E, E)to be given, as well as a family (µt), t ∈ T , of transition kernels from (E, E) to (E, E).2 Thenthe family (µt), t ∈ T , is called a Markov semigroup (of transition kernels) if the equation

µsµt = µs+t (3.2.1)

holds true for all s, t ∈ T.Equation (3.2.1) is also called the Chapman-Kolmogorov equation (cf. Proposition 1.17.32 in[Dre18]).

The name semigroup is motivated by the fact that the kernels (µn), n ∈ T , form a semigroup inthe algebraic sense with respect to composition.

In the case T = N, and countable state space (E, 2E) the Markov transition kernels µn giveexactly the probability that a Markov process (or Markov chain in this setting) jumps fromstate x ∈ E to state y ∈ E in n time steps: µn(x, y), and as is further detailed in [Dre18], thelaw of the entire Markov chain is then uniquely characterised (modulo the initial distribution)by the transition matrix P (x, y), x, y ∈ E, which give the probability that the Markov chainjumps from x to y in one time step, so

P (x, y) = µ1(x, y). (3.2.2)

Display (3.2.1) then corresponds to the fact that for m,n ∈ N we have PmPn = Pm+n.

Example 3.2.2. In the case (E, E) = (Rd,B⊗d), a class of Markov semigroups is given by thosewhich have a density with respect to Lebesgue measure.

(a) A case of a particularly important semigroup is the Brownian semigroup of transitionkernels, which is named after botanist Robert Brown and will give rise to Brownian motionin a later chapter. We choose T = [0,∞) and define the transition kernels

µt(x, ·) :=1

(2πt)d2

exp− (y − x) · (y − x)

2t

· λd(dy), t > 0,

and µ0(x, ·) = δx. Hence, we have that µt(x, ·) is the law of a N (x, tId)-distributed randomvariable, with the common convention that for t = 0 this equals δx.

We first have to check that this defines a semigroup of transition kernels and for thispurpose, observe that for s, t > 0 (the case s = 0 can be checked separately) as well asx0 ∈ Rd and B ∈ B⊗d we have

(µsµt)(x0, B) =

∫Rdµs(x0,dx1)µt(x1, B) =

∫Rdµs(x0,dx1)µt(0, B − x1)

= (µs(x0, ·) ∗ µt(0, ·))(B),

where we took advantage of the translation invariance of the semigroup (µt) in the sensethat for any x, y ∈ Rd and B ∈ B⊗d we have µt(x+ y,B) = µt(x,B − y).

Now due to Corollary 3.8.8 from [Dre19] we infer that the convolution on the right-handside N (x0, (s+t)Id) distribution, so it equals µs+t(x0, ·), which shows the validity of (3.2.1).

In particular, we obtain the existence of random walk (Xn), n ∈ N0, with Gaussian in-crements via Theorem 3.1.19. However, this is not big news since we knew before (e.g.through Kolmogorov’s existence and uniqueness theorem)

2In this case we will also say that µn is a transition kernel on (E, E).

63

(b) For T = [0,∞) define the Markov semigroup (µt), t ∈ T, of transition kernels on (N0, 2N0)

via µ0 = δ0, and for % ∈ (0,∞) fixed, denoting by νt the law of a Poi(%t) distributed randomvariable, we define

µt(x,B) := νt(B − x), ∀x ∈ N0, B ∈ 2N0 .

Then (µt), t ∈ T , defines a Markov semigroup on (N0, 2N0): Indeed, for t ∈ T, we have

that µt is a mapping from N0 × 2N0 to [0, 1], and for each F ∈ 2N0, the mapping

N0 3 x 7→ µt(x, F )

is certainly 2N0 − B-measurable. Furthermore, for each x ∈ N0, the mapping

2N0 3 F 7→ µt(x, F )

describes a probability measure by definition. It remains to show the Chapman-Kolmogorovequation (3.2.1). For this purpose, recall that by Corollary 3.8.8 (b) of [Dre19] we havethat the convolution of a Poi(%1t) with a Poi(%2t) distribution results in a Poi((%1 + %2)t)distribution. Hence, we infer for s, t ∈ T , x ∈ N0, and B ∈ 2N0 that

µsµt(x,B) =

∫N0

µs(x,dy)µt(y,B) =

∫N0

µs(x,dy)µt(0, B − y) = (νs(· − x) ∗ νt)(B)

= (νs ∗ νt)(B − x) = νs+t(B − x) = µs+t(x,B),

and we’re done. This semigroup is also referred to as the Poisson semigroup (with param-eter %).

In both of the above examples we have seen that the fact that we were able to write the product ofMarkov kernels as a convolution of probability measures was crucial. In fact, there is a generaltheory of (translation invariant) convolution semigroups making this more precise, but we willnot go into further detail here.

While Markov semigroups are the most important ones to us, our fundamental existence resultbelow can be formulated for more general families of transition kernels.

Definition 3.2.3. Let (E, E) be a measurable space and let a two parameter family of transitionkernels µs,t, s, t ∈ T , where T ⊂ R, on (E, E) be given. The family is called consistent ifµs,tµt,u = µs,u for all s, t, u ∈ T with s < t < u.

Example 3.2.4. For any Markov semigroup µt, t ∈ [0,∞), the family defined via

µs,t := µt−s, s, t ∈ [0,∞), s < t,

is a consistent family of transition kernels.

The following can be interpreted as a generalisation of Ionescu-Tulcea’s theorem to some un-countable index sets T ⊂ R. However, the price we pay is that we also lose some generality whencomparing to Ioncescu-Tulcea since we only deal with consistent families of transition kernelsas well as Markov kernels in what comes below. The stochastic processes that naturally arisefrom Markov kernels (i.e., Markov processes, see Theorem 3.2.8 below) generally provide a goodapproximation to real world processes on the one hand, and on the other hand they are quiteamenable to mathematical investigations and numerical simulations.

Theorem 3.2.5. Let µs,t, s, t ∈ T ⊂ [0,∞), with s < t be a consistent family of transitionkernels on a Polish space (S,O) with 0 ∈ T.Then there exists a (unique) transition kernel µ from (S,B(S)) to (ST ,B(S)⊗T ) such that forall x ∈ S and all finite subsets J = t0, t1, . . . , tn ⊂ T with 0 = t0 < t1 < . . . < tn, we have

µ(x, ·) π−1(t0,...,tn)(B) =

(δx ⊗

n⊗i=1

µti−1,ti

)(B) ∀B ∈ B(S)⊗J . (3.2.3)

64

Proof. Similarly to the proof we gave for Ionescu-Tulcea’s theorem before, the proof can beperformed using Kolmogorov’s existence and uniqueness theorem. We leave the details to theReader and refer to [Kle14, Thm. 14.42] for a proof.

We can generalise the previous result to the case of Markov semigroups with an arbitrary giveninitial distribution.

Corollary 3.2.6. Let (µt), t ∈ T ⊂ [0,∞), be a Markov semigroup of transition kernels on aPolish space (S,O) with 0 ∈ T.If ν is a probability measure on (S,B(S)), then there exists a unique probability measure Pν on(ST ,B(S)⊗T ) such that for all 0 = t0 < t1 < . . . < tn with ti ∈ T,

Pν (πt0 , . . . , πtn)−1 = ν ⊗n−1⊗i=0

µti+1−ti .

In the case ν = δx some x ∈ E, we also write Px for Pν .

Proof. This follows from Theorem 3.2.5.

The previous result tells us that given a suitable family of transition kernels, we can find aprobability measure on the T -fold product space which projects down on the finite dimensionalspaces and fulfills the corresponding consistency condition. As we have seen before in the contextof Kolmogorov’s existence and uniqueness theorem, this gives rise to a stochastic process (Xt),t ∈ T, with the property that its law is given by P.Since we will not delve in the depths of Markov processes here but only give some basic results,we will restrict ourselves to giving the definition of so-called time homogeneous processes – i.e,the transition probabilities from x at time s to B at time t depend on s and t only through thedifference t− s.However, every so often in mathematics, when concepts become more advanced and more tech-nical, there is oftentimes an entire zoo of just slightly different definitions around. This is thecase for Markov processes as well.

Definition 3.2.7. Let a Polish space (S,O) be given, as well as a family (Px), x ∈ S, ofprobability measures on (Ω,F), and measurable mappings (Xt), t ∈ T , from (Ω,F) to (S,B(S))i.e., for each x ∈ S, the family (Xt), t ∈ T , is a stochastic process on (Ω,F ,Px). If

(a)

Px(X0 = x) = 1 ∀x ∈ S;

(b) for every t ∈ T, the mapping

S × B(S) 3 (x, F ) 7→ Px(Xt ∈ F )

defines a transition kernel on (S,B(S));

(c) the Markov property is fulfilled in the sense that for all x ∈ S, s, t ∈ T with s < t, and forall B ∈ B(S), we have

Px(Xt+s ∈ B | Fs) = PXs(Xt ∈ B) Px − a.s., (3.2.4)

with (Ft) the canonical filtration.

65

then we call the process (Xt), t ∈ T , a Markov process.3

We will also get back to the general Definition 3.2.7 in Chapter 5 when introducing Brownianmotion. For the time being, however, before giving examples in Section 3.2.1, we will show howMarkov processes generally arise from the semigroups introduced above.

Theorem 3.2.8.Let (µt), t ∈ T ⊂ [0,∞), be a Markov semigroup of transition kernels on a Polish space (S,O)with 0 ∈ T.Then there exists a family of probability measures (Px), x ∈ S, on a measurable space (Ω,F)as well as a stochastic process (Xt), t ∈ T, defined (Ω,F ,Px), such that the process (Xt) isa Markov process on (Ω,F ,Px), x ∈ S, with respect to the canonical filtration (Ft), and withtransition probabilities given by

Px(Xt ∈ B) = µt(x,B), ∀t ∈ T, x ∈ S, B ∈ B(S). (3.2.5)

Vice versa, if there is a family of probability measures (Px), x ∈ S, and a probability space (Ω,F)such that (Xt), t ∈ T , is a Markov process on (Ω,F ,Px), x ∈ S, then defining µt via the equationin (3.2.5) we obtain a Markov semigroup.

Proof. Let a Markov semigroup as in the assumptions be given. Corollary 3.2.6 implies that forthe given Markov semigroup, for each x ∈ S there exists a unique probability measure Px on(ST ,B(S)⊗T ) such that for 0 = t0 < t1 < . . . < tn with ti ∈ T,

Px π−1t0,...,tn

= δx ⊗n−1⊗i=0

µti+1−ti , (3.2.6)

Define on (ST ,B(S)⊗T ,Px) the coordinate process

Xt : ST 3 ω 7→ ωt ∈ S, t ∈ T.

Then the general setup of Definition 3.2.7 is fulfilled, and also Part (a) holds true. Furthermore,Part (b) is fulfilled since (3.2.6) implies

Px(Xt ∈ F ) = µt(x, F ), ∀x ∈ S, F ∈ B(S), (3.2.7)

and µt is a transition kernel by assumption. It remains to show

Px(Xtn ∈ Fn | Ftn−1) = PXtn−1(Xtn−tn−1 ∈ Fn) Px − a.s., (3.2.8)

and we do so by showing that the right-hand side has the required measurability and integrabilityproperties of (a version of) Px(Xtn ∈ Fn | Ftn−1).To start with, note that the Ftn−1 − B-measurability of the random variable on the right-handside of (3.2.8) follows from (3.2.7) and the Ftn−1 − B-measurability of Xtn−1 . Furthermore,from (3.2.6) we infer that for 0 = s0 < s1 < . . . < sm := tn−1, m ∈ N, as well as Fi ∈ B(S),i ∈ 0, . . . ,m, we have

Px(Xtn ∈ Fn, Xsi ∈ Fi, ∀i ∈ 0, 1, . . . ,m

)=

∫Fn−1

Px(Xsi ∈ Fi, ∀i ∈ 0, 1, . . . ,m− 1, Xtn−1 ∈ dxn−1

)· µtn−tn−1(xn−1, Fn)

= Ex[ m∏i=0

1Fi(Xsi) · µtn−tn−1(Xtn−1 , Fn)]

= Ex[ m∏i=0

1Fi(Xsi) · PXtn−1(Xtn−tn−1 ∈ Fn)

].

(3.2.9)

3Similar concepts also go by ‘universal Markov process’ ( [Bau92]) or ‘Markov family’ ( [KS91], [Str11]) in theliterature.

66

Since the π-systemXsi ∈ Fi, 0 = s0 < s1 < . . . < sm = tn−1, Fi ∈ B(S), m ∈ N

generates

Ftn−1 , in combination with Corollary 1.2.18 of [Dre19] this implies (3.2.8), and hence (Xt) asdefined above is a Markov process indeed.To establish the converse direction, let a Markov process be given. So we have a family ofprobability measures (Px), x ∈ S, with the required properties, and we define a transition kernelvia

µt(x, F ) := Px(Xt ∈ F ), t ∈ T, x ∈ S, F ∈ B(S). (3.2.10)

Then µt defines a transition kernel on (S,B(S)) due to Definition 3.2.7 (b).Using the Markov property we find that

(µsµt)(x, F ) =

∫µs(x, dx1)µt(x1, F ) =

∫Px(Xs ∈ dx1)Px1(Xt ∈ F )

= Ex[PXs(Xt ∈ F )

]= Px(Xs+t ∈ F ) = µs+t(x, F ),

which all in all establishes that (µt), t ∈ T , forms a Markov semigroup.

While for the case T = N0, this is essentially all we want to have, in the case T = [0,∞) veryinteresting questions arise concerning the so-called path properties of a process (Xt), t ∈ [0,∞).Indeed, for (Ω,F ,P) := ([0, 1],B([0, 1]), λ|[0,1]) we can define stochastic processes (Xt), t ∈ [0, 1],via

Xt(ω) = 1ω(t) t ∈ [0, 1], ω ∈ Ω.

as well as (Yt), t ∈ [0, 1], viaYt(ω) = 0 t ∈ [0, 1], ω ∈ Ω.

Then (Xt) is a process for which its sample paths

[0, 1] 3 t 7→ Xt(ω), ω ∈ Ω,

are not continuous, whereas the paths

[0, 1] 3 t 7→ Yt(ω), ω ∈ Ω,

are continuous for all ω ∈ Ω.However, all finite dimensional distributions of (Xt) and (Yt) coincide, and hence so does theprojective limit, i.e., the consistent probability measure on (R[0,1],B⊗[0,1]). In particular, thisimplies that the path properties are therefore not uniquely characterised by the projective limit.Therefore, in addition to comparing processes by comparing their finite-dimensional distribu-tions, it turns out to be useful to consider the following stronger properties.

Definition 3.2.9. Let (Xt) and (Yt) be two stochastic processes. We say that (Xt) is a modifi-cation of (Yt) if for each t ∈ T we have

P(Xt = Yt) = 1.

Definition 3.2.10. Let (Xt) and (Yt) be two stochastic processes. We say that (Xt) and (Yt)are indistinguishable if there exists a set A with P(A) = 1 such that

A ⊆ Xt = Yt ∀t ∈ T

It is easy to check that if (Xt) and (Yt) are indistinguishable, then they are also modifications ofeach other, and if they are modifications of each other, then they have the same finite-dimensionaldistributions.

67

Exercise 3.2.11. Show that the converses of the previous implications are not generally true.

In particular, Example 3.2.2 (a) that Theorem 3.2.5 applied to the Brownian semigroup wouldalmost already give us Brownian motion – however, using this construction we can so far notguarantee continuous sample paths (or let alone that the paths are measurable functions from[0,∞) to R).There is, however, a powerful abstract result which gives a sufficient criterion for the existence ofa modification of the original process such that this modification has continuous sample paths.Before stating the result we recall the notion of Holder continuity.

Definition 3.2.12. Let (X1, d1) and (X2,d2) be metric spaces. For α ∈ (0, 1], a functionf : X1 → X2 is called α-Holder continuous at x ∈ X1 if for each ε > 0 ,

supy∈X1,d1(x,y)≤ε

d2(f(x), f(y))

d1(x, y)αis finite. (3.2.11)

f is called locally α-Holder continuous if for each x ∈ X1 there exists ε > 0 such that

supy,z∈X1,

d1(x,y)≤εd1(x,z)≤ε

d2(f(y), f(z))

d1(y, z)αis finite.

It is called α-Holder continuous if (3.2.11) is bounded from above as a function in x ∈ X1.

Theorem 3.2.13 (Kolmogorov-Chentsov). Let (Xt), t ∈ [0,∞), be a stochastic process takingvalues in a complete metric space (X,d). If for every T > 0 there exist constants α, β, C ∈ (0,∞)such that

E[d(Xs, Xt)α] ≤ C|s− t|1+β, ∀s, t ∈ [0, T ],

then there exists a modification (Xt), t ∈ [0,∞), of (Xt), t ∈ [0,∞), such that for each γ ∈ (0, βα),

the sample paths [0,∞) 3 t 7→ Xt(ω) are locally γ-Holder continuous for P-almost all ω ∈ Ω.

Proof. A complete proof of this result would divert us too far from the main topics of this class.We therefore refer to [Kal02, Thm. 3.23] for further details.

Exercise 3.2.14. (a) Consider Example 3.2.2 (a). Start with showing the following result.

Lemma 3.2.15. Let Y ∼ N (0, t · Idd) be a d-dimensional normally distributed randomvariable with covariance matrix t · Idd. Then for n ∈ N there exists a constant Cn ∈ (0,∞)such that for all t ∈ (0,∞),

E[‖Y ‖2n2 ] = Cntn.

(b) Show that the Markov process arising from the Poisson semigroup from Example 3.2.2 (b)does not satisfy the assumptions of Theorem 3.2.13 for any admissible choice of param-eters (in distribution this is the ‘Poisson process’, but we will require some sample pathsregularity and hence construct it explicitly below).

Taking advantage of this powerful result, with a yet another small definition we are now evenable to prove the existence of Brownian motion.

Definition 3.2.16. An (Rd,Bd)-valued stochastic process (Xt), t ∈ [0,∞), has independentincrements if

∀ s, t ∈ [0,∞) with s < t, we have that Xt −Xs is independent of Fs, (3.2.12)

with (Ft), t ∈ [0,∞), denoting the canonical filtration of (Xt), t ∈ [0,∞).Furthermore, we say that the process (Xt) has stationary increments, if for any 0 < s < t, wehave that all random variables Xt+r −Xs+r, r ≥ 0, have the same distribution.

68

Remark 3.2.17. Condition (3.2.12) is equivalent to the following: For all t1, . . . , tn ∈ [0,∞)with t1 < . . . < tn, the random variables Xti+1 − Xti, i ∈ 1, . . . , n − 1, form an independentfamily.

Definition 3.2.18. Let x ∈ Rd, and let (Bt), t ∈ [0,∞) be an (Rd,Bd)-valued stochastic processdefined on (Ω,F ,P) such that:

(a)

B0 = x;

(b) the increments of the process (Bt) are independent;

(c) for any 0 ≤ s < t, the random variable Bt −Bs is N (0, (t− s)Idd)-distributed;

(d) P-almost surely, the mapping

[0,∞) 3 t 7→ Bt

is continuous.

Then (Bt) is called Brownian motion starting in x. If x = 0, then this process is also referredto as standard Brownian motion.

Remark 3.2.19. As in Corollary 3.2.6, if we can construct Brownian starting in x ∈ Rd, then,given a probability distribution ν on (Rd,Bd), we can also construct Brownian motion with initialdistribution ν.

Theorem 3.2.20. For any x ∈ Rd, Brownian motion starting in x exists, and P-a.s. the samplepaths are even locally α-Holder continuous for each α ∈ (0, 1

2).

Proof. Recall the Brownian semigroup defined in Example 3.2.2 (a). Hence, due to Theorem3.2.8 and (3.2.6), we can infer the existence of a Markov process with independent and appro-priately normally distributed increments.

The only property which has not been addressed so far is that of local α-Holder continuoussample paths. Due to Exercise 3.2.14 (a), however, for α ∈ (0, n−1

2n ) we can invoke Theorem3.2.13 to provide us with the existence of a modification of the original process such that thismodification has the desired local α-Holder continuity of the sample paths.

We now observe that since if two processes are modifications of each other, we have that P-a.s.,they coincide at all times t ∈ Q ∩ [0,∞). But since P-a.s., the sample paths of continuousprocesses (indexed by [0,∞)) are already characterised by their evaluations at rational times,we infer that α-Holder continuous version from above must already be P-a.s. locally α-Holdercontinuous for all α ∈ (0, 1

2).

Before we move on from Brownian motion to another process, it is natural to ask at this stagewhether more than one Brownian motion that starts at x exists, that is, does the choice ofthe probability space lead to “different” processes. We answer this question in the followingproposition.

Proposition 3.2.21. Let (Xt), Px, x ∈ S be a Markov process and let (Yt) be a stochastic processon (Ω, F , P) such that

P(Yt ∈ B | Fs) = PYs(Xt−s ∈ B), s < t ∈ T,B ∈ B(S). (3.2.13)

Then the distribution of (Yt) under P is the same as the distribution of (Xt) under Pν , wherePν :=

∫Pxν(dx) and Y0 ∼ ν.

69

Proof. Note first that (3.2.13) and algebraic induction imply that

E[f(Yt) | Fs] = EYs [f(Xt−s)]

for any measurable function f . To show that the distributions of (Xt) and (Yt) under theirrespective laws are the same, it suffices to show that for all n ∈ N0 and any choice of measurablefunctions f1, . . . , fn it holds that

E[f1(Yt1) · · · fn(Ytn)] = Eν [f1(Xt1), · · · , fn(Xtn)].

We prove this by induction, starting with n = 1. Here it holds that

E[f1(Yt1)] = E[E[f1(Yt1) | F0]] = E[EY0 [f1(Xt1)]] = Eν [f1(Xt1)],

where the second equality follows by our initial observation and the last equality follows by thedefinition of Pν .We now argue the inductive step n → n + 1. Writing gn(x) = fn(x)Ex[fn+1(Xtn+1−tn)] andnoting that this defines a measurable function, we have

Eν [f1(Xt1) · · · fn+1(Xtn+1)] = Eν [f1(Xt1) · · · fn(Xtn)EXtn [fn+1(Xtn+1−tn)]]

= Eν [f1(Xt1) · · · gn(Xtn)] = E[f1(Yt1) · · · gn(Ytn)]

= E[f1(Yt1) · · · fn(Ytn)EYtn [fn+1(Xtn+1−tn)]]

= E[f1(Yt1) · · · fn(Ytn)E[fn+1(Ytn+1) | Ftn ]]

= E[f1(Yt1) · · · fn+1(Ytn+1)],

which concludes the proof.

In the following section we construct in detail (and without referring to results we have notyet proven) another process that will be important to us. In addition to the distributionalrequirements, we will also impose additional properties on the sample paths.

3.2.1 Poisson process

In order to construct the Poisson process, we could proceed using the machinery introducedabove (see Remark 3.2.26); we will, however, follow a more hands-on approach here which willprovide us with the fact that P-a.s., the sample paths of the process will be cadlag, i.e., ‘continuea droite, limite a gauche’ (English: RCLL, ‘right-continuous with left-limits’).

Definition 3.2.22. For κ > 0 let (Tn), n ∈ N, be a sequence of i.i.d. Exp(κ)-distributed randomvariables. For t ∈ [0,∞) define

Nt := supn ∈ N0 :

n∑i=1

Ti ≤ t. (3.2.14)

The stochastic process (Nt), t ∈ [0,∞), is called a Poisson process with rate κ.

Lemma 3.2.23. Let Nt be a Poisson process with rate κ ∈ (0,∞). Then for P-a.a. ω ∈ Ω, thesample paths

[0,∞) 3 t 7→ Nt(ω)

are right-continuous with left limits.

Proof. From (3.2.14) it follows that for any t ∈ (0,∞), we have that P-a.s., Nt ∈ N0. In partic-ular, P-a.s., (Nt) has only finitely many jumps on the interval [0, t], and using (3.2.14) again,we infer that P-a.s., (Nt) has left limits and is right-continuous on [0, t]. Summing over t ∈ Nestablishes the claim.

70

The reason that the process introduced above is called Poisson process is given by the followinglemma.

Proposition 3.2.24. Let (Nt), t ∈ [0,∞), be a Poisson process with rate κ > 0. Then (Nt)has stationary and independent increments; furthermore, for any s, t ∈ [0,∞) with s < t,

Nt −Ns ∼ Poi(κ(t− s));

We furthermore take advantage of the following claim.

Claim 3.2.25. Let X1, . . . , Xn be independent real random variables whose distributions havedensities ϕ1, . . . , ϕn with respect to the Lebesgue measure. Then the random vector (X1, . . . , Xn)has density (x1, . . . , xn) 7→ ϕ1(x1) · . . . · ϕn(xn) with respect to the n-dimensional Lebesgue mea-sure.

Proof. For B1, . . . , Bn ∈ B(R) we have

P((X1, . . . , Xn) ∈ B1 × . . .×Bn) =

n∏i=1

P(Xi ∈ Bi) =

n∏i=1

∫Bi

ϕi(xi) dxi

=

∫B1×...×Bn

n∏i=1

ϕi(xi) d(x1, . . . , xn),

Hence, the two measures on B(Rn) defined via

P((X1, . . . , Xn) ∈ ·)

and ∫·

n∏i=1

ϕi(xi) d(x1, . . . , xn)

coincide on the generating π-system of B(Rd) given by the hypercubes, and thus Theorem 1.2.16of [Dre19] yields the result.

Proof sketch of Proposition 3.2.24. We show that for k, n ∈ N0 as well 0 < s < t we have

P(Ns = k,Nt −Ns = n) = e−κs(κs)k

k!e−κ(t−s) (κ(t− s))n

n!. (3.2.15)

While this only proves that the two increments Ns and Nt −Ns are independent and have thedesired distribution, the general case follows in a similar way but is notationally strictly moreinvolved, hence we leave it to the reader.

Using that the T1, T2, . . . form an i.i.d. sequence of Exp(κ) variables and taking advantage ofClaim 3.2.25, we obtain integrating iteratively from ‘the inside out’ that

P(Ns = k,Nt = n+ k) = κn+k+1

∫[0,∞)

dx1 . . .

∫[0,∞)

dxn+k+1

e−κ∑n+k+1i=1 xi1

∑ki=1 xi≤s, xk+1>s−

∑ki=1 xi,

∑n+ki=1 xi≤t, xn+k+1>t−

∑n+ki=1 xi.

The crucial observation now is that for x1, . . . , xk+n fixed with∑n+k

i=1 xi ≤ t,

κ

∫[0,∞)

dxn+k+1 e−κ

∑n+k+1i=1 xi1xn+k+1>t−

∑n+ki=1 xi = e−κt,

71

so after this integration we’re left with only indicators in the previous iterated integral. Fur-thermore, for x1, . . . , xk such that

∑ki=1 xi ≤ s, by induction on n we obtain∫

[0,∞)dxk+1 . . .

∫[0,∞)

dxn+k 1xk+1>s−∑ki=1 xi,

∑n+ki=k+1 xi≤t−

∑ki=1 xi

=(t− s)n

n!.

Hence, the first display of the proof can be continued to obtain

P(Ns = k,Nt = n+ k) =(κs)k

k!

(κ(t− s))n

n!e−κt,

which establishes (3.2.15) and hence proves what we wanted.

It is often assumed that the interarrival times between customers to e.g. a restaurant or astore, or the times between calls to a customers service department are i.i.d. exponentiallydistributed.4 As pointed out in the discussion subsequent to Theorem 1.9.9 of [Dre19] already,it seems reasonable to assume the number of customers that arrive in a certain time interval tobe Poisson distributed. Therefore, the above gives yet another manifestation of it (hinging onthe same mathematical assumptions, though).

Remark 3.2.26. Alternatively, we can also introduce the Poisson process (modulo continuoussample paths) as the Markov process arising from the Poisson semigroup via Example 3.2.2 (b)and Theorem 3.2.8, where (3.2.6) ensures the independent and appropriately Poisson distributedincrements. In particular, the conclusions of Proposition 3.2.24 would essentially have followedfrom the construction.This, however, would not yet have given us the connection to the sum of independent exponen-tially distributed random variables as derived above. More importantly, as we have seen before,the construction of the process by specifying a Markov semigroup / the finite dimensional dis-tributions, does not yet give us information on the regularity of the sample paths t 7→ Xt(ω).Instead, for the path regularity we could try to invoke the Kolmogorov-Chentsov theorem; you cancheck, cf. Example 3.2.14 (b), that the assumptions of this result are not fulfilled in the contextof the Poisson process, so we could not deduce any path regularity from it. This makes sense inthe light of the hands-on construction that we followed above, where we obtain that P-a.s., thesample paths are cadlag, but not continuous.

It can be shown that the above can be generalised to Poisson processes N(t) with non-constantrates, i.e., when we have a function κ : [0,∞)→ [0,∞), then N(t)−N(s) is Poisson distributedwith parameter

∫ ts κ(r)dr. Like the Poisson process, the generalised Poisson process turns out

very useful in investigating mathematical problems that find their motivations in applicationssuch as statistical physics or finance, see Example 4.1.12 below.One reason the Poisson process is so important in probability theory is that the time betweentwo jumps is exponentially distributed and hence has no memory. Indeed, we (might) haveseen already, that the only distributions without memory are the exponential distributions (andaccommodating the discrete nature of the geometric distribution, it can also be consideredmemoryless). This meant that if X ∼ Exp(κ), then

P(X ≥ s+ t |X ≥ t) = P(X ≥ s) ∀s, t ∈ [0,∞).

If we now want to construct a continuous time Markov process which is essentially made upof finitely many jumps on compact intervals (think of simple random walk, but now with timeindexed by [0,∞) instead of N0; Brownian motion, however, does not fall into this regime), thenthe waiting times between two consecutive jumps have to be exponentially distributed in order

4Of course, this is only an approximation since there are usually more people eating out for lunch or dinnerthan at 4 pm in the afternoon. During peak times, however, this is considered a reasonable approximation.

72

to preserve the Markov property. Indeed, if the time between two jumps was not memoryless,we could gain some knowledge about the time of a future jump by knowing for how long theprocess had been sitting at the current site. This, however, translates to Ft having strictly moreinformation about the future that just Xt, a contradiction to (3.0.1).

Example 3.2.27 (Continuous time random walk). Let (Nt) be a Poisson process with parameterκ > 0, and let (Xn), n ∈ N, be an i.i.d. sequence of real random variables. Then

St :=

Nt∑i=1

Xi

defines a continuous time random wxalk (with increment distribution P X−11 ).

3.3 Martingales and the Markov property

There is a fundamental connection between Markov processes and martingales which goes back(at least) to [SV79]. While the resulting theory is particularly useful in the case of continuoustime when investigating stochastic differential equations and diffusion processes, we will touchupon it in the setting of discrete time and countable state space in order to keep technicalitieslight and understanding deep. In particular, we will denote by P the corresponding transitionmatrix.To introduce the principal ideas assume the following setting: Assume given a stochastic process(Xn), n ∈ N0, with each Xn taking values in a countable state space S, and suppose that we wantto characterise the distribution of the entire process as a probability measure on (SN0 , (2S)⊗N0).In order to stick to the notation of Theorem 3.1.19, for n ∈ N, set Ωn := ×ni=0Si and Ω :=×∞i=0Si, and define the σ-algebras Fn := ⊗ni=02S , as well as F := ⊗∞i=02S on these spaces. Thenthe sequence of regular conditional probability distributions of Xn given Fn−1 in combinationwith the distribution of X0 already completely determines the distribution of the process (Xn).Indeed, since the cylinder sets form a π-system generating the product σ-algebra, the distributionof the process (Xn) is completely determined by its projections down to (Ωn,Fn), and for F0 ×. . .×Fn ∈ Fn we have using the the tower property for conditional expectations in combinationwith the disintegration formula, that

P((X0, . . . , Xn) ∈ F0 × . . .× Fn) = E[. . .E[ E[1Xn∈Fn | Fn−1]︸ ︷︷ ︸Theorem 3.1.8

= µn(·,Xn∈Fn)

1Xn−1∈Fn−1 | Fn−2] . . .],

with µn the regular conditional probability of Xn given Fn−1.5 Specifying these (µn) again is

equivalent to specifying the conditional expectations

E[f(Xn) | Fn−1]

for a sufficiently rich class of bounded functions f : S → R. More specifically, for such f fixedwe can introduce the notation

hn−1(X0, . . . , Xn−1) := E[f(Xn) | Fn−1]− f(Xn−1), (3.3.1)

which in particular implies

E[f(Xn)− f(Xn−1)− hn−1(X0, . . . , Xn−1) | Fn−1

]= 0,

5Essentially, this follows from Ionescu-Tulcea’s theorem 3.1.19 since the µn we have here can be ‘extended’ totransition kernels µn from (Ωn−1,Fn−1) to (Ωn,Fn) by defining

µn(·, (X1, . . . , Xn) ∈ F1 × . . .× Fn) := 1(X1,...,Xn−1)∈F1×...×Fn−1µn(·, Xn ∈ Fn).

73

and so summing the previous display over time yields that the process

f(Xn)− f(X0)−n−1∑i=0

hi(X0, . . . , Xi), n ∈ N0,

defines a martingale.

Now if (Xn) is not only a stochastic process but even satisfies the Markov property (3.0.1),then we immediately infer that hn−1 can be written as a function of Xn−1 only instead ofX0, . . . , Xn−1, and furthermore hn−1 does not depend on n − 1. Hence, in a slight abuse ofnotation we can rewrite h through (3.3.1) as

h(x) :=∑y∈S

P (x, y)f(y)− f(x) = ((P − I)f)(x),

i.e.,

h = (P − Id)f (3.3.2)

as functions on S. Thus we infer that if (Xn) is a Markov chain and f : S → R bounded, then

f(Xn)− f(X0)−n−1∑i=0

h(Xi), n ∈ N0, (3.3.3)

is a martingale again.

The power of this approach stems from the fact that the converse of the above is true as well;i.e., the law of a Markov chain with transition matrix P and some initial state x0 is the uniqueprobability measure on (SN0 , (2S)N0) such that (3.3.3) describes a martingale for all f andcorresponding h.

We introduce the following definition in order to formalise (3.3.2).

Definition 3.3.1. For a Markov chain on a countable state space S and with transition matrixP, we call the matrix

L := P − Id (3.3.4)

the generator of the Markov chain.

Remark 3.3.2. The definition corresponding to (3.3.4) in continuous time would be the operatorL defined via

(Lf)(x) := limt↓0

1

t

(Ex[f(Xt)]− f(x)

);

of course, existence of Lf is non-trivial here and one would have to investigate for which fthat expression actually makes sense. By investigating the discrete time case we avoid thesetechnicalities.

Using this notation we can rewrite (3.3.3) as

Mn := f(Xn)− f(X0)−n−1∑i=0

(Lf)(Xi), n ∈ N0. (3.3.5)

Also observe that this describes the Doob decomposition of the process f(Xn), n ∈ N0, since(Mn) is a martingale and the sum on the right-hand side defines a previsible process. Nowcoming back to the above, we can formulate the following result which provides us with a linkbetween Markov chains and those matrices L which make the process in (3.3.5) a martingale.

74

Theorem 3.3.3 (Martingale problem). Let (Xn), n ∈ N0, be a stochastic process taking valuesin the countable state space S and adapted to a filtration (Fn). Furthermore, let L be an operatoron RS.6

Then (Xn) is a Markov chain if and only if for each bounded functions f , the process (Mn)defined in (3.3.5) is a martingale.In this case, the relation between P and L is given by

P = Id + L. (3.3.6)

Proof. We have seen above that if (Xn) is a Markov chain, then (3.3.5) describes a martingaleindeed.To prove the converse direction, in order to show that (Xn) is a Markov process indeed, usingthe notation of (3.3.5) we compute

E[f(Xn+m) | Fn] = E[Mn+m | Fn] + f(X0) +n+m−1∑i=0

E[(Lf)(Xi) | Fn]

= Mn + f(X0) +

n−1∑i=0

(Lf)(Xi) +

n+m−1∑i=n

E[(Lf)(Xi) | Fn]

= f(Xn) +

n+m−1∑i=n

E[(Lf)(Xi) | Fn].

Plugging in m = 1 we infer that

E[f(Xn+1) | Fn] = f(Xn) + E[(Lf)(Xn) | Fn] = f(Xn) + (Lf)(Xn),

from which by choosing indicator functions f = 1B it follows that

P(Xn+1 ∈ B | Fn) = 1B(Xn) + (L1B)(Xn) = ((L+ Id)(1B))(Xn).

But the right-hand side is σ(Xn)-measurable, so (3.0.1) follows by iteration. This shows that(Xn) defines a Markov chain with transition matrix as in (3.3.6) and hence finishes the proof.

Remark 3.3.4. Of course we know that generators of Markov chains with countable state spaceS are of the form

L = P − Id,

where P is any stochastic matrix in [0, 1]S×S . As alluded to before, however, in the continuoussetting it is oftentimes not so clear if an operator generates a Markov chain, and it is in thatsetting where the theory from above unfolds its full power.

Remark 3.3.5. We can generalise the notion of (super- / sub)harmonic functions introducedin Remark 2.0.9. In fact, in the continuous context the 1

2∆ defined on suitable functions fromRd to R is the generator of Brownian motion (with its discrete counterpart

∆ : RZd → RZd ,

f 7→( 1

2d

∑y∈Zd,

‖x−y‖1=1

(f(y)− f(x)))x∈Zd

being the generator of discrete time simple random walk). Therefore, in the same way that wewere able to define harmonic functions for general Markov chains in [Dre18], we can generalise

6I.e., L can be interpreted as a matrix in RS×S taking a function f : S → R to (Lf)(x) :=∑y∈S L(x, y)f(y).

75

the notion of sub- / superharmonic functions. Indeed, a function f : S → R is called harmonic,subharmonic, or superharmonic (on U ⊂ S) with respect to a generator L, if

Lf = 0, Lf ≥ 0, or Lf ≤ 0,

respectively, where all (in-)equalities are to be understood as functions on S (or as functions onU only).Theorem 3.3.3 again supplies us with the fact that for a Markov chain (Xn) with generator Lwe have that, independently of the initial condition, (f(Xn)) is a martingale, submartingale,or supermartingale, respectively, if (and essentially only if) f is harmonic, subharmonic, orsuperharmonic with respect to L, respectively.

We now turn to some application of the above.

Exercise 3.3.6. The details of the following are left to the reader.

(a) Show that for d ≥ 3, simple random walk is transient, i.e., that

Px(τ0 <∞) < 1 (3.3.7)

for all x ∈ Zd. For this purpose, we first observe that it is sufficient to show that (3.3.7)holds true for all x with ‖x‖ large enough.

(b) Show that simple random walk in Z2 is recurrent, i.e.,

Px(τ0 <∞) = 1.

3.4 Doob’s h-transform

An important tool for investigating Markov processes is the so-called h-transform which we willintroduce below. The motivation mainly is to investigate Markov processes conditioned on firsthitting some set in a certain subset – what can we say about the resulting process?For the sake of simplicity we stick to the setting of Markov processes with countable state spaceS and discrete time N0.Now choose some x ∈ S as well as some A,B ⊂ S with A ⊂ B. What can we say about theprocess (Xn) under the measure Phx := Px(· |XHB ∈ A)? While this conditioned process willstill be a process in the sense that the sequences (Xn) are well-defined under this probabilitymeasure, it is a priori not clear what the law of this conditioned process is, nor if it will stillenjoy the Markov property.Also, in order to facilitate things notationally, we will consider the modification (Xn) of (Xn)for which for each y ∈ B, the transition probabilities are changed to

P (y, z) := 1y=z,

i.e., B plays the role of a so-called cemetery, and all other transition probabilities are leftunchanged; all corresponding quantities for this modified process will carry a ˜ on them.Introducing the function

h : S 3 x 7→ Px(τB = τA),

the probability measure Phx has a density with respect to the standard probability measure Pxwhich can be written as

dPhxdPx

=1τB=τA

h(x).

76

Note that for y ∈ S we have

h(y) = Py(τA = τB) = Ey[Ey[1τA=τB | X1]

]= Ey[h(X1)] = (P h)(y),

so the function h is harmonic with respect to L = P−Id (even on S due to our modification, andotherwise the last equality in (3.4.1) below would generally not be true anymore). Therefore,the density is well-behaved under conditioning in the sense that

Ex[dPhx

dPx

∣∣∣ Fn] =Ex[1τB=τA | Fn]

h(x)=

Px(τB = τA | Xn)

h(x)=h(Xn)

h(x), (3.4.1)

and where we have taken advantage of the harmonicity of h with respect to L, as well as (of aconsequence) of the Markov property of (Xn).This again implies that for any f : S → R bounded we obtain, using that (Xn) has the Markovproperty with respect to Px, that

Ehx[f(Xn+1) | Fn] = Ex[f(Xn+1)

h(Xn+1)

h(x)| Fn

]= Ex

[f(Xn+1)

h(Xn+1)

h(x)| Xn

],

and so we infer that (Xn) has the Markov property under Phx also. Plugging in indicator func-tions, we infer that it has transition probabilities given by

Phx(Xn+1 = z | Xn = y) =h(z)

h(y)P (y, z), (3.4.2)

for all y, z ∈ S such that Phx(Xn = y) > 0. In particular, for y ∈ S\B, the transition probabilityto z ∈ S is given by

h(z)

h(y)P (y, z).

Also note that the above construction did not depend on the explicit nature of the functionh, but only on the fact that h was harmonic with respect to P . This leads us to the followingdefinition.

Definition 3.4.1. Let P be the transition matrix of a Markov chain and let h : S → [0,∞) bea function which is harmonic with respect to P. Then Doob’s h-transform is the Markov chainwith transition probabilities P h given by

P h(y, z) =h(z)

h(y)P (y, z), y, z ∈ S.

Remark 3.4.2. It is possible to define P h also directly for a harmonic h and transition matrixP , without resorting to P . In this case, we set

P h(y, z) =

h(z)h(y)P (y, z), if h(y) > 0

1y=z, otherwise; y, z ∈ S.

Note that P h defined in this way agrees with the one from Definition 3.4.1.

Example 3.4.3 (Gambler’s ruin). Consider simple random walk on 0, 1, . . . ,M, some M ∈ N,starting in x ∈ 1, . . . ,M − 1. In Section 2.0.2 of [Dre18] (using harmonic functions) as wellas in Example 2.1.15 (using martingales) we found that the probability of reaching M before 0when starting in i was given by i/M. We now want to obtain a more profound understanding ofthe process itself (instead of just the hitting probabilities) by investigating the law of this simplerandom walk conditioned on H0,M = HM .

77

The above derivation tells us that this process again has the Markov property, so it has a tran-sition matrix. The adequate function is given by

h(x) := Px(τ0,M = τM ), x ∈ 0, 1, . . . ,M,

which is harmonic and positive on 1, . . . ,M. Using h(0) = 0 and h(M) = 1 it’s not hard tosee that in fact

h(i) =i

M, i ∈ 0, . . . ,M.

Hence, from (3.4.2) we infer that the transition probabilities of this h-transform are given by

P h(x, x± 1) =1

2

h(x± 1)

h(x)=

1

2± 1

2x, x ∈ 1, . . . ,M − 1.

Lemma 3.4.4. Let (Xn), n ∈ N0, be a Markov process with countable state space S and transi-tion matrix P. Let h : S → [0,∞) be a function which is harmonic on x ∈ S : h(x) > 0 withrespect to P − Id, and denote by

F :=x ∈ S : ∃y ∈ S such that h(y) = 0 and P (x, y) > 0

the set of points in S from which the process can jump to an element of the zero set of h.Then, for x ∈ S with h(x) > 0, under Phx the process

1

h(Xn∧τF )

is a martingale with respect to the canonical filtration (Fn) of the process (Xn∧τF ), n ∈ [0,∞).

Proof. We have for n > 0 that

Ehx[ 1

h(X(n+1)∧τF )

∣∣∣Fn] =∑

y∈S :h(y)>0

P h(Xn∧τF , y)

h(y)=

∑y∈S :h(y)>0

P (Xn∧τF , y)h(y)

h(y)h(Xn∧τF )

=∑

y∈S :h(y)>0

P (Xn∧τF , y)

h(Xn∧τF )=

P(h(X(n+1)∧τF ) > 0)

h(Xn∧τF )=

1

h(Xn∧τF ),

which completes the proof.

Example 3.4.5. As in Example 3.4.3 let (Xn) denote one-dimensional simple random walk, anddenote by Phx, x ∈ Z\0, the Doob transform under which Phx(X0 = x) = 1, and under which(Xn) is simple random walk conditioned on not hitting 0 (corresponding to taking M → ∞,which is OK since we’ve seen that the transition probabilities actually do not depend on M , sowe can use an existence theorem of our choice (Ionescu-Tulcea, Kolmogorov) for this process).Then, for x, y, z ∈ N with 1 < x < y < z the following hold true:

(a)

Phy(τx < τz) =x(z − y)

y(z − x);

(b)

Phy(τx <∞) =x

y;

(c)

Phx(@n ∈ N : Xn = x) =1

2x.

Proof. Left to the reader.

78

Chapter 4

Stationarity & ergodicity

We have seen that the law of large numbers is one of the most important limit theorems wehave gotten to know so far (along with the central limit theorem). The goal of this sectionis to extend this result significantly – in particular, we will be able to cover situations whereimplications similar to that of the strong law of large numbers are obtained in regimes where wedo have certain dependence between the respective random variables. See Remark 4.1.2 belowalso for a embedding ergodicity in its historic context.

Definition 4.0.1. Let (Xt), t ∈ T, with T = N,N0,Z, or T = R,1 be a stochastic process takingvalues in a Polish space (S,O) with Borel σ-algebra B(S). Then if for all s ∈ T we have that forall t1 < . . . < tn,

P((Xt1 , . . . , Xtn) ∈ ·) = P((Xt1+s, . . . , Xtn+s) ∈ ·)

as measures on (Sn,B(Sn)), the stochastic process (Xt), t ∈ T , is called time homogeneous orstationary.

Remark 4.0.2. In the setting of Definition 4.0.1, a stochastic process is stationary if and onlyif for all s ∈ T, the processes (Xt), t ∈ T, and (Xt+s), t ∈ T, have the same law on (ST ,B(S)⊗T ).

Exercise 4.0.3. In the setting of Definition 4.0.1, find an example which shows that in order forthe process (Xt) to be stationary it is not sufficient to have that the one-dimensional distributionsof time shifts coincide, i.e., that P X−1

t = P X−1t+s for all s ∈ T.

Example 4.0.4. (a)

You might have seen in Proposition 1.17.11 of [Dre18] that every Markov chain on a finitestate space S has a stationary distribution, and in Proposition 1.17.15 of [Dre18] that ifthe chain is irreducible (see [Dre18]), then the stationary distribution is unique. Let π besuch a distribution and denote the transition matrix of the chain by P. Then if we start thechain in the initial distribution π, then for the resulting stochastic process (Xn) we havethat (4.0.1) holds true.

(b) Recall that the Ornstein-Uhlenbeck process as defined in the exercise classes. As we didbefore in the construction of Markov processes, instead of starting the process in a deter-ministic point, we could also start it from an initial distribution N (0, σ

2

2θ ). One can showthat with this initial distribution, the resulting process is a stationary process and fulfills(4.0.1).

(c) Let (Xn), n ∈ N0, be an i.i.d. sequence of real random variables and for m ∈ N fixed aswell as c0, . . . , cm ∈ R fixed, define the sequence Yn :=

∑mi=0 ciXn+i. Show that (Yn) is a

stationary process (‘moving average process’).

1More generally, any monoid T ⊂ R with respect to addition would work here, and this is what we assume forthe rest of this chapter.

79

(d) Show that Brownian motion is not stationary, no matter which initial distribution youchoose.

The concept of stationary processes can be generalised to more general settings as follows.

Definition 4.0.5. Let (Ω,F , µ) be σ-finite measure space. An F − F-measurable mappingθ : Ω → Ω is called a measure-preserving transformation (‘maßerhaltende Transformation’) ifµ = µ θ−1.In this setting, µ is also called an invariant measure for θ.

Example 4.0.6. (a) Given the distribution P on (SN0 ,B(S)⊗N0) of a stationary process, thecoordinate shift θ : (xn)n∈N0 7→ (xn+1)n∈N0 is a measure-preserving transformation on(SN0 ,B(S)⊗N0).

Certainly θ is a map from SN0 onto itself, and it’s measurable since for an arbitrary cylinderset A ∈ B(S)⊗N0 we have θ−1(A) = S ×A which is a cylinder set again, and hence it’s inB(S)⊗N0 . But the cylinder sets generate B(S)⊗N0 and hence the desired measurability of θfollows from Theorem 1.4.8 of [Dre19].

Similarly, for a cylinder set A ∈ B(S)⊗N0 we have P(A) = P θ−1(A) (since the processis stationary), hence the two measures coincide on a ∩-stable generator of B(S)⊗N0 andtherefore coincide due to Dynkin’s π-λ Theorem. Thus, θ is measure-preserving.

(b) Consider the space (Rd,B(Rd), λd). Show that for any y ∈ Rd, the shift

θy : Rd → Rd,x 7→ x+ y,

is a measure-preserving transformation.

(c) Let (xn), n ∈ N0, be an S-valued sequence with period p ∈ N, i.e., xn = xn+p for alln ∈ N0. Define the discrete probability measure

P :=1

p

p−1∑k=0

δ(xn+k)n∈N0.

Then P is invariant under the shift θ : SN0 → SN0 which maps (xn)n∈N0 to (xn+1)n∈N0.

(d) Consider the unit circle S1 = x ∈ C : |x| = 1, which can be reparametrised as S1 =eiα : α ∈ [0, 2π). For an angle β ∈ R we can define the rotation by the angle β via

θβ : eiα 7→ ei(α+β).

Show that this defines a measure-preserving transformation on the space (S1,B(S1), µ),with µ denoting the uniform measure on (S1,B(S1) normalised to µ(S1) = 1.

(e) Consider the following map θ from ([0, 1],B([0, 1]), λ|[0,1]) onto itself:

θ(x) :=

2x, if x ∈ [0, 1

2),2− 2x, if x ∈ [1

2 , 1]

The map is called ‘baker’s map’ since if you visualise how the map is acting on [0, 1] thisis slightly reminiscent of kneading dough.

Show that the map is measure-preserving.

The following definition will most probably not sound very useful until we get to know Theorem4.1.1 below.

80

Definition 4.0.7. Let θ be a measure-preserving transformation on (Ω,F , µ). The σ-algebra

I :=F ∈ F : θ−1(F ) = F

is called the σ-algebra of θ-invariant sets.A measure-preserving transformation on (Ω,F , µ) is called ergodic (‘ergodisch’), if for all F ∈ I,either µ(F ) = 0 or µ(Ω\F ) = 0.

Remark 4.0.8. It is immediate that if µ is a probability measure, then the transformation isergodic if and only if µ(F ) ∈ 0, 1 for all F ∈ I.

Exercise 4.0.9. (a) Check that I of Definition 4.0.7 defines a σ-algebra indeed.

(b) Let n ∈ N with n ≥ 2. Show that the ergodicity of a transformation θ does not imply theergodicity of θn, n ≥ 2. Does the ergodicity of θ follow from the ergodicity of θn? Findproofs or counterexamples!

Since we’re studying probability theory, the following notion is sufficiently important to deserveits own definition.

Definition 4.0.10. In the setting of Example 4.0.6 (a), if the underlying stochastic process (Xt)is stationary (i.e., the shift operator θ is measure-preserving on the given probability space), thenthe process is called ergodic if and only if the shift operator θ is ergodic.

Remark 4.0.11. According to the previous definition, a stochastic process in particular has tobe stationary in order for it to possibly be ergodic. As mentioned in Example 4.0.4, if you starta Markov chain in equilibrium this in particular supplies us with a stationary chain, i.e., thetime shift is measure-preserving. Especially in the context of Markov chains, however, you willencounter chains (Xn) called ‘ergodic’, but which do not even have the same distributions atdifferent times P X−1

n = P X−1n+m for all m,n ∈ N0.

Nevertheless, these chains are sometimes called ergodic if the implication of the (say, individual)ergodic theorem still hold true: For any initial distribution of the chain P-a.s.,

1

n

n−1∑i=0

f(Xn)→∑x∈S

π(x)f(x),

for any (bounded) function f : S → R, and with π denoting the unique stationary distributionof the chain. (see e.g. Theorem 4.7 in [LPW09]). This means that even though the chain isneither stationary nor ergodic, we still have the mantra that spatial averages equal asymptotictime averages, and this holds true e.g. for

Exercise 4.0.12. Determine which of the measure-preserving shifts defined in Example 4.0.6 (b)are ergodic (for some of them that is not completely easy and you might want to take advantageof Lemma 4.0.15 below to show it).

In order to understand the above notions a bit better, we give the following auxiliary result.

Lemma 4.0.13.Let (Ω,F ,P) be a probability space, and let θ be a transformation on Ω which is measure-preserving.

(a) A real F − B-measurable function f : (Ω,F ,P) → R is I − B-measurable if and only iff θ = f .

(b) The transformation θ is ergodic if and only if any I−B-measurable function f : (Ω,F ,P)→R is P-a.s. constant.

81

Proof. (a) Left as an exercise.

(b) Let θ be ergodic, and f a I − B-measurable function. Then for each s ∈ R we have that

P(f−1((−∞, s))) ∈ 0, 1,

and since the left-hand side is monotone increasing in s with lims↓−∞ P(f−1((−∞, s))) = 0and lims↑∞ P(f−1((−∞, s))) = 1 we get the P-a.s. equality

f = sups ∈ R : P(f−1((−∞, s))) = 0 ∈ R.

Conversely, if any I −B-measurable function is P-a.s. constant, then in particular, for anyF ∈ I we have 1F is P-a.s. constant, which implies P(F ) ∈ 0, 1. Since F ∈ I was chosenarbitrarily, this completes the proof.

We can introduce another pretty general and sufficient criterion for a shift to be ergodic. Whilethere are in fact different notions of ‘mixing’ around, we will just introduce the following.

Definition 4.0.14. Let θ be a measure-preserving transformation on (Ω,F ,P). Then θ is calledmixing if for all F,G ∈ F we have

P(F ∩ θ−n(G))→ P(F )P(G) as n→∞. (4.0.1)

Lemma 4.0.15. Let θ be a measure-preserving transformation on (Ω,F ,P). Then if θ is mixing,it is also ergodic.

Proof. Let F ∈ I, with I the σ-algebra of θ-invariant elements of F . We have

P(F ) = P(F ∩ θ−n(F ))→ P(F )2,

so P(F ) = P(F )2 which implies P(F ) ∈ 0, 1, so θ is ergodic with respect to P.

Example 4.0.16. (a) In the same way that we transferred the property of ergodicity fromtransformations to processes (cf. Example 4.0.6 (a)), we can do so for the property ofmixing. Show that the moving average process from Example 4.0.4 (c) is mixing (hint:show that (4.0.1) holds true for F, G cylinder events, and then reason that this is sufficientto deduce the mixing property).

4.1 Ergodic theorems

See [Var01, Dud02, Str11] for more analytically flavoured approaches to the following ergodictheorems.

Theorem 4.1.1 (Individual ergodic theorem, Birkhoff’s ergodic theorem).Let X ∈ L1(Ω,F ,P), and let θ be a measure-preserving transformation on (Ω,F ,P). Then P-a.s.,

1

n

n−1∑k=0

X θk → E[X | I]. (4.1.1)

Remark 4.1.2. (a) The theorem is oftentimes applied to random variables f X, where X isa random variable and f a real valued function such that f X satisfies the assumptions ofthe theorem (consider X ∈ L2 and f(x) := x2 as an example). In this case (4.1.1) becomes

1

n

n−1∑k=0

f(X θk)→ E[f(X) | I].

82

(b) There’s a version of Theorem 4.1.1 for σ-finite measure spaces also (see e.g. [Dud02]or [Str11]), but we’ll stick to the setting of probability spaces for simplicity.

(c) While a general intuition for the law of large numbers had been around for some timeeven before a rigorous proof of the result had been found, ergodicity is a bit harder tograsp. Indeed, it arguably first appeared in the incipient works on statistical mechanics byBoltzmann, notably [Bol68]. The guiding interpretation has been that the ‘probability’ offinding a particle at a certain part of the state space2 should be given the normalized longtime local time – i.e., the relative amount of time a particle has spent in that area in along-term observation. Hence, ergodicity is oftentimes also phrased as ‘time average equalsensemble average (i.e., expectation)’.

Mathematically, ergodic theory only developed from the beginning of the 20th century on-wards

(d) Theorem 4.1.1 can be used to give yet another proof of recurrence of (simple) random walk(in d = 1) – since we have seen a variety of different proofs of this fact already, we referto [Kle14, Section 20.4] for further details.

(e) It is immediate that combining Example 4.0.6 (a) with Theorem 4.1.1 we obtain our thirdproof of the law of large numbers after Theorem 3.6.9 in [Dre19] and Exercise 2.4.5. Theversion we can derive here, like Exercise 2.4.5, only needs first moments of Xn.

Most proofs of this result proceed via the so-called maximal ergodic lemma, which is the followingresult.

Lemma 4.1.3 (Hopf’s maximal ergodic lemma).Let X ∈ L1(Ω,F ,P), and let θ be a measure-preserving transformation on (Ω,F ,P). Writing

Sn :=n−1∑i=0

X θi, n ≥ 1,

S0 := 0,Mn :=

nmaxi=0

Si

and introducing the setPn := Mn > 0,

we haveE[X1Pn ] ≥ 0.

Proof. We start with observing that

Sj+1 = X + Sj θ, j ≥ 1,

hence for j ∈ 1, . . . , n we have

Sj = X + Sj−1 θ ≤ X +Mn θ. (4.1.2)

Since furthermore on Pn we haveMn =

nmaxj=1

Sj ,

we infer, taking the maximum over j ∈ 1, . . . , n in (4.1.2) that

Mn ≤ X +Mn θ on Pn.

2Indeed, as we will also be investigating in the construction of Brownian motion below, there is a priori nothingintrinsically random in the motion of a particle, say, molecule – however, a computation of its trajectory on thebasis of the determining parameters is oftentimes too complex.

83

This implies the first inequality in the following line of (in-)equalities:∫Pn

X dP ≥∫Pn

(Mn −Mn θ) dP =

∫ΩMn dP−

∫Pn

Mn θ dP ≥∫

ΩMn dP−

∫ΩMn θ dP = 0,

where the last inequality follows since Mn θ ≥ 0, and the last equality follows since θ ismeasure-preserving.

We can now prove the ergodic theorem, and we follow the standard proof these days.

Proof of Theorem 4.1.1. The first step is to observe that we can replace X by X−E[X | I], andhence assume w.l.o.g. that

E[X | I] = 0; (4.1.3)

so it boils down to showing

1

n

n−1∑k=0

X θk → 0, (4.1.4)

and for brevity we introduce Xn := X θn and Sn :=∑n−1

k=0 Xk. We want to show that for eachε > 0,

P(

lim supn→∞

1

nSn > ε︸ ︷︷ ︸

=:Lε

)= 0, (4.1.5)

and replacing Xk by −Xk we would obtain (4.1.4).Introduce the random variables

Xεk := (Xk − ε)1Lε , k ∈ N0

as well as

Sεn :=n−1∑k=0

Xεk.

Observing that Lε = θ−1(Lε), we infer that Lε ∈ I, and as a consequence, (Xεk) is a stationary

process.Using the notation M ε

n := maxni=0 Sεi , with Sε0 := 0 we apply the maximal ergodic lemma

(Lemma 4.1.3) to obtain0 ≤ E[Xε

01Mεn>0],

for each n ∈ N. Using the dominated convergence theorem, the right-hand side then convergesto E[Xε

01supn∈N Sεn>0].

But nowsupn∈N

Sεn > 0 = supn∈N

Sn > ε ∩ Lε = Lε,

and hence we infer with the above that

0 ≤ E[(X0 − ε) · 1Lε ] = E[1Lε E[X | I]︸ ︷︷ ︸(4.1.3)

= 0

]− εP(Lε) = −εP(Lε).

This entails P(Lε) = 0, and therefore (4.1.5) and (4.1.4) follow.

Theorem 4.1.4 (Mean ergodic theorem). Let p ∈ [1,∞), X ∈ Lp(Ω,F ,P), and let θ be ameasure-preserving shift on (Ω,F ,P). Then

1

n

n−1∑k=0

X θk → E[X | I] in Lp(Ω,F ,P). (4.1.6)

84

Birkhoff’s individual ergodic theorem supplies us with the P-a.s. convergence. Therefore, recall-ing our results for uniformly integrable families of random variables, we infer that it’s sufficientto show that the family of random variables occurring on the left-hand side of (4.1.4), raised totheir p-th powers, is uniformly integrable.

Lemma 4.1.5. Let p ∈ [1,∞) as well as (Xn) a sequence of identically distributed but notnecessarily independent real random variables with Xn ∈ Lp. Then setting

Yn :=∣∣∣ 1n

n−1∑k=0

Xk

∣∣∣p,the sequence (Yn), n ∈ N0, is uniformly integrable.

Proof. By Theorem 2.3.6 there exists ϕ : [0,∞) → [0,∞) non-decreasing and convex withϕ(x)/x→∞ as x→∞, and such that

C0 := E[ϕ(|X0|p)] <∞. (4.1.7)

We then obtain by applying Jensen’s inequality twice that

ϕ(Yn) ≤ ϕ( 1

n

n−1∑k=0

|Xk|p)≤ 1

n

n−1∑k=0

ϕ(|Xk|p);

indeed, for both inequalities we took the average 1n

∑n−1k=0 . . . as expectation, and for the first

inequality we applied Jensen’s inequality with the convex function x 7→ |x|p, whereas for thesecond we used the convexity of ϕ.Since the Xk are identically distributed, in combination with (4.1.7) this implies that for eachn ∈ N0 we have

E[ϕ(Yn)] ≤ C0,

which again due to Theorem 2.3.6 yields the desired uniform integrability.

Proof of Theorem 4.1.4. We take advantage of Lemma 4.1.5 in order to deduce that, since θis measure-preserving and thus all the X θk, k ∈ N0, are identically distributed, the family ofrandom variables defined via

Zn :=∣∣∣ 1n

n−1∑k=0

X θk − E[X | I]∣∣∣p

is uniformly integrable. Since Birkhoff’s ergodic theorem implies that Zn → 0 holds P-a.s.as n → ∞, Theorem 2.3.9 implies that Zn → 0 in L1 as well, which entails the desired Lp-convergence.

Remark 4.1.6. In the important case of a transformation θ being ergodic, Lemma 4.0.13 (b)implies that E[X] = E[X | I]. Therefore, in the settings of Theorems 4.1.1 and 4.1.4 we haveconvergence of

1

n

n−1∑i=0

X θi

to E[X] P-a.s., and in Lp, respectively.

The following example gives a nice connection between probability theory, number theory, andapplications.

Example 4.1.7. Benford’s law has even been used in e.g. in the context of detection tax evasion,see http: // search. proquest. com/ docview/ 211023799? pq-origsite= gscholar . We re-fer to [Kle14, p. 447] for a short version and to [Ben38] as well as to [BH+11] for longerbedtime stories.

85

4.1.1 Subadditive ergodic theorem

A substantial enhancement of Birkhoff’s ergodic theorem given in the previous section is thesubadditive ergodic theorem. Indeed, in many models inspired by real-world applications onedoes not have stationarity of the corresponding sequence of random variables, but some weakerproperties instead. Nevertheless, under some suitable assumptions one can still derive a resultin the spirit of the ergodic theorems we investigated before.

Definition 4.1.8. A sequence (xn)n∈N of real numbers is called subadditive if

xn+m ≤ xn + xm ∀n,m ∈ N

(and it’s called superadditive if (−xn), n ∈ N0, is subadditive).

Lemma 4.1.9 (Fekete’s lemma). If (xn)n∈N is a subadditive sequence, then limn→∞xnn exists

in R ∪ −∞ and

limn→∞

xnn

= infn∈N

xnn

(4.1.8)

Proof. Once we have proven (4.1.8), the fact that the limit is an element of R ∪ −∞ isimmediate.It is clear that

lim infn→∞

xnn≥ inf

n∈N

xnn

holds, so in order to show the converse inequality, let ε > 0 be arbitrary and let nε ∈ N suchthat

xnεnε≤ inf

n∈N

xnn

+ ε. (4.1.9)

For n > nε write n = knε + r with r ∈ 0, . . . , nε and k ∈ N. Then using the subadditivity weinfer

xnn≤ kxnε + xr

n=kxnεn

+xrn,

and taking the lim sup we infer

lim supn→∞

xnn≤ xnε

(4.1.9)

≤ infn∈N

xnn

+ ε.

Since ε > 0 was chosen arbitrarily, this proves the result.

The following result is a stochastic generalisation of Fekete’s lemma (Lemma 4.1.9), and itis originally due to Kingman [Kin73] and has been improved to its current form by Liggett,see [Lig85]. In fact, the latter article is a very good read and entirely accessible at your level ofprobability theory.

Theorem 4.1.10 (Subadditive ergodic theorem). Let (Xm,n) n ∈ N0, 0 ≤ m ≤ n, be an arrayof real random variables satisfying the following properties:

(a)

X0,n ∈ L1 ∀n ∈ N,

infn∈N

E[X0,n]

n> −∞;

(b)

X0,n ≤ X0,m +Xm,n ∀n ∈ N, m ∈ 1, . . . , n− 1; (4.1.10)

86

(c) for each k ∈ N, the sequence (Xkn,k(n+1)), n ∈ N, forms a stationary process;

(d) for each k ∈ N0, the distribution of the sequence (Xk,k+n), n ∈ N, in (RN,B⊗N) is thesame as that of the sequence (Xk+1,k+1+n), n ∈ N, in (RN,B⊗N).

Then:

X∞ := limn→∞

X0,n

nP-a.s. and in L1,

with

E[X∞] = limn→∞

1

nE[X0,n] = inf

n→∞

1

nE[X0,n].

If in addition the stationary processes in assumption (c) are ergodic (i.e., the canonical shift onthe product space is ergodic), then X∞ is P-a.s. constant and equal to E[X∞].

Remark 4.1.11. (a) A first version of the above result had earlier been proven by Kingman[Kin73], and this is why the above theorem is often referred to as (an improved version /Liggett’s version of) Kingman’s subadditive ergodic theorem.

A crucial difference to the above result is that Kingman [Kin73] assumed

Xl,n ≤ Xl,m +Xm,n ∀0 ≤ l ≤ m ≤ n,

instead of (4.1.10). This assumption is strictly stronger than (4.1.10).

(b) In the case of the corresponding transformation θ being ergodic, Theorem 4.1.10 is indeedan improvement over the ergodic theorems we have investigated in the previous section.

Assuming that the conditions of the (individual, or mean for p = 1) ergodic theorem arefulfilled, for m,n ∈ N0 with m < n set

Xm,n :=

n∑i=m+1

X θi.

Then

X0,n = X0,m +Xm,n.

Therefore, the subadditive ergodic theorem supplies us with

X∞ = limn→∞

1

n

n−1∑i=0

X θi P− a.s. and in L1,

with X∞ = E[X], which is the statement the previous ergodic theorems would supply uswith, too.

Proof. The proof can be found in [Lig85] and, although not very short, is not overly complicatedeither. It takes advantage of Birkhoff’s ergodic theorem itself so we omit proving it here.

We do give an example for its application here which is essentially motivated by current research,it is a simplified version of some results of [DGRS12].

Example 4.1.12 (Existence of quenched Lyapunov exponents in the parabolic Andersonmodel). Let ξ := (ξ(x))x∈Zd be a real-valued random field which is ergodic with respect to thecanonical shifts θzξ := (ξ(x + z))x∈Zd, for all z ∈ Zd.3 For simplicity assume further that

3Again, in our lingo this means that with our canonical probability space (RZd

,B⊗Zd

,P), with P the law of(ξ(x))x∈Zd , the canonical shifts θz : Zd 3 x 7→ x+ z ∈ Zd are measure-preserving and ergodic.

87

|ξ(0)| ≤ A P-a.s. some A > 0. The solution of the parabolic Anderson model (PAM) is thesolution of the following (random) parabolic difference equation with random potential ξ:4

∂tu(t, x) = ∆u(t, x)− γξ(x)u(t, x),

u(0, x) = 1,x ∈ Zd, t ≥ 0, (4.1.11)

where ξ are as before, ∆f(x) = 12d

∑‖y−x‖=1(f(y) − f(x)) is the discrete Laplacian on Zd that

we have bumped into before and γ > 0 is a constant.By the so-called Feynman-Kac formula, the solution u admits the probabilistic representation

u(t, 0) = EX0[exp

−γ∫ t

0ξ(Xs) ds

],

where (Xs)s∈[0,∞) is a continuous time simple random walk with jump rate 1, which is indepen-

dent from the potential (ξ(z))z∈Zd.5 For x ∈ Zd let PXx and EXx denote respectively probabilityand expectation for a jump rate 1 simple symmetric random walk X, starting from x. For each0 ≤ t and x, y ∈ Zd, define

e(t, x, y, ξ) := EXx[exp

−γ∫ t

0ξ(Xu) du

1Xt=y

],

a(t, x, y, ξ) := − log e(t, x, y, ξ).

(4.1.12)

a(t, x, y, ξ) is also referred to as the point to point passage function from x to y. We will proveparts of the following so-called shape theorem for a(t, 0, y, ξ).

Theorem 4.1.13. [Shape theorem] There exists a deterministic convex function α : Rd → R,which we call the shape function, such that P-a.s., for any compact K ⊂ Rd,

limt→∞

supy∈tK∩Zd

|t−1a(t, 0, y, ξ)− α(y/t)| = 0. (4.1.13)

Remark 4.1.14. The term shape theorem is motivated by the fact that α+ ess inf ξ(0) definesa norm on Rd (check!) and the theorem gives information about the shape of the balls x ∈ Rd :α(x) + ess inf ξ(0) ≤ r, r ∈ (0,∞), in this norm.

More precisely, we will show the following weaker result.

Proposition 4.1.15. There exists a deterministic function α : Qd → [−A,∞) such that P-a.s.,for every y ∈ Qd,

limt→∞ty∈Zd

t−1a(t, 0, ty, ξ) = α(y). (4.1.14)

Proof. Since we assume y ∈ Qd and ty ∈ Zd, without loss of generality, it suffices to considery ∈ Zd and t ∈ N. Note that by the definition of the passage function a in (4.1.12), P-a.s.,

a(t1 + t2, x1, x3) ≤ a(t1, x1, x2) + a(t2, x2, x3) ∀ t1 < t2, x1, x2, x3 ∈ Zd.

Together with our assumption on the ergodicity of ξ, this implies that the two-parameter familya(t − s, sy, ty), 0 ≤ s ≤ t with s, t ∈ N0, satisfies the conditions of the subadditive ergodictheorem (Theorem 4.1.10). Therefore, there exists a deterministic constant α(y) such that(4.1.14) holds.

4See [K16] for a survey of this model.5So note that we have two kinds of randomness here, one for the random potential ξ and one for the random

walk, and these two are supposed to be independent; so one way to actually set this up is to construct bothprocesses on different probability spaces, and then take the product of these two probability spaces as our newprobability space, which is big enough to accommodate for all the randomness needed.Oftentimes, we just want to take expectations with respect to one of the two processes. Formally, this correspondsto taking the conditional expectation, so taking the expectation with respect to the random walk starting in 0technically boils down to taking the conditional expectation E[. . . | (ξ(z))z∈Zd ].

88

To extend the definition of α(y) in Theorem 4.1.13 to y /∈ Qd and to prove the uniform conver-gence in (4.1.13), we need to establish the equicontinuity of t−1a(t, 0, ty) in y, as t → ∞. Formore details, see [DGRS12].

89

90

Chapter 5

Brownian motion and random walks

For a profound study of Brownian motion we refer to the monograph [MP10], which is one ofthe standard sources on the topic, including relatively recent and advanced results.

An important fact in this section, which will be used without further mentioning, is that thedistribution of a d-dimensional normally distributed random variable is characterised by itsexpectation and its covariance matrix already (cf. Examples 2.2.8 and 1.5.8 of [Dre19]).

5.1 Another constructions of Brownian motion

We recall the definition of Brownian motion in Definition 3.2.18.

While we have already seen in Theorem 3.2.20 that Brownian motion exists, we want to give somefurther possibilities of construction, in particular since in Theorem 3.2.20 we took advantage ofthe result by Kolmogorov and Chentsov (which we did not prove) in order to deduce the desiredP-a.s. continuity of the sample paths.

As a warm-up, we start with a simple exercise.

Exercise 5.1.1. Let (Bt), t ∈ [0,∞) be standard Brownian motion and let r ∈ [0,∞). Then theprocess (Bt+r −Br), t ∈ [0,∞), is also standard Brownian motion.

5.1.1 Levy’s construction of Brownian motion

This construction is due to P. Levy [L92], the first edition of which had been published in 1948.We will follow the version of [MP10] here. There are other constructions of Brownian motion as auniform limiting process (in the supremum norm for continuous functions on compact intervals)of continuous processes (e.g. using Haar functions, see Section 21.5 in [Kle14] or Section 2.5in [KS91]), but the one we give is very intuitive and furthermore has nice consistency properties.

Theorem 5.1.2 (N. Wiener (1923)). Standard Brownian motion exists.

For the proof we will need a couple of lemmas that are interesting in their own rights.

Lemma 5.1.3. For X ∼ N (0, 1) we have for x > 0 that(1

x− 1

x3

) 1√2πe−x

2/2 ≤ P(X ≥ x) ≤ 1

x

1√2πe−x

2/2.

Proof.

We have √2πP(X ≥ x) =

∫ ∞x

e−s2

2 ds ≤∫ ∞x

s

xe−

s2

2 ds = x−1e−x2

2 ,

which is the upper bound.

91

The lower is slightly more involved, and we compute using the transformation s 7→ x+ sx as well

as the fact that 1− z < e−z for z > 0,

√2πP(X ≥ x) =

∫ ∞x

e−s2

2 ds = x−1

∫ ∞0

e−x2+2s+s2/x2

2 ds

= x−1e−x2/2

∫ ∞0

e−2s+s2/x2

2 ds ≥ x−1e−x2/2

∫ ∞0

e−s(1− s2

2x2) ds,

and the latter integral equals 1− x−2, which finishes the proof.

Lemma 5.1.4. Let X be a d-dimensional Gaussian random variable, i.e., distributed accordingto N (µ,Σ) with mean µ ∈ Rd and covariance matrix Σ ∈ Rd×d. Then the coordinates (Xi),i ∈ 1, . . . , d, are independent random variables if and only if

Cov(Xi, Xj) = 0 ∀ i, j ∈ 1, . . . , d with i 6= j. (5.1.1)

Proof. We have gotten to know d-dimensional normally distributed random variables in Exam-ples 2.2.8 and 1.5.8 of [Dre19] for instance, which in particular showed that the distribution of ad-dimensional normally distributed random variable is completely determined by its mean andits covariance matrix.

Since independence of the (Xi) implies that they are uncorrelated also, the interesting directionis the reverse; so assume (5.1.1) to hold. Then, if (Y1, . . . , Yd) denotes a random vector such thatits entries are i.i.d. N (0, 1)-distributed, we infer that the vector (µ1 +

√Var(X1)Y1, . . . , µd +√

Var(Xd)Yd) has the same distribution as the vector (X1, . . . , Xd), which implies the indepen-dence of the (Xi), i ∈ 1, . . . , d.

Exercise 5.1.5. Convince yourself that it is essential in the previous lemma to not only havethat the Xi are all Gaussian but that indeed the vector X is a Gaussian random vector. Indeed,consider the case d = 2, X1 ∼ N (0, 1), and ε an independent Rademacher-distributed randomvariable. Set X2 := ε ·X1. Check that X1 and X2 are both N (0, 1)-distributed random variables,that they are uncorrelated, but not independent.

Lemma 5.1.6. Let X and Y be real random variables which are independent and N (0, σ2)-distributed (σ2 ∈ (0,∞)). Then the random variables X + Y and X − Y are independent andN (0, 2σ2)-distributed.

Proof.

It is obvious from Example 3.3.11 (a) in [Dre19] that X+Y and X−Y are N (0, 2σ2)-distributed,and that (X + Y,X − Y ) is a Gaussian random vector since it equals(

1 11 −1

)(XY

);

you may convince yourself that this means that the vector (X+Y,X−Y ) is Gaussian distributedwith mean 0 and covariance matrix given by

σ2

(1 11 −1

)(1 11 −1

)t.

But we have

E[(X + Y )(X − Y )] = E[X2]− E[Y 2] = 0,

which means Cov((X +Y )(X −Y )) = 0, and so Lemma 5.1.4 implies the desired independence.

92

Proof of Theorem 5.1.2. We first note that it is sufficient to standard construct Brownian motion(Bt) on the interval [0, 1] (i.e., a stochastic process satisfying all the properties of Brownianmotion but with [0,∞) replaced by [0, 1]), since we can then extend it to [0,∞) by ‘patching

together’ a countable number of independent Brownian motions. Indeed, denoting by (B(n)t ),

t ∈ [0, 1], n ∈ N, an independent family of such processes, we can define

Bt :=

btc∑i=1

B(i)1 +B

(btc+1)t−btc , t ∈ [0,∞),

and check that this satisfies the properties of Brownian motion.Now in order to define Brownian motion (Bt), t ∈ [0, 1], on [0, 1], the idea is to first construct iton the dyadic numbers. For this purpose, set

Dn :=k2−n : k ∈ 0, . . . , 2n

, n ∈ N.

as well as D := ∪n∈NDn.In order to do this, let a family of i.i.d. random variables (Xt), t ∈ D, be given such thatXt ∼ N (0, 1).To start with, we want to construct a process (Bt), t ∈ D, such that the following propertiesare fulfilled for each n ∈ N:

(a) For r < s < t ∈ Dn the random variables Bt − Bs and Bs − Br are independent, and theformer is N (0, t− s) distributed whereas the latter is N (0, s− r) distributed;

(b) the family of random variables Bt, t ∈ Dn, is independent of the family Xt, t ∈ D\Dn.

DefineB0 := 0 and B1 := X1, (5.1.2)

and inductively, given Bt for all t ∈ Dn−1, n ∈ N, we define for t ∈ Dn\Dn−1

Bt :=Bt−2−n +Bt+2−n

2+ 2−

n+12 ·Xt =

Bt+2−n −Bt−2−n

2+Bt−2−n + 2−

n+12 ·Xt, (5.1.3)

and where we note thatBt+2−n−Bt−2−n

2 as well as 2−n+1

2 ·Xt are both N (0, 2−(n+1))-distributed.While, as alluded to before, the first summand is a linear interpolation between the values of theprocess at the neighbouring dyadic sites of the next largest scale, the second is an independentnormal perturbation of exactly the right scale.We now prove inductively that the process (Bt), t ∈ D, constructed above fulfills the desiredproperties. Indeed, for n = 0 it is obvious from (5.1.2) that (b) and (a) hold true. Now assumethat (b) and (a) hold for all numbers in 0, 1, . . . , n, some n ∈ N0. Then we infer in combinationwith (b) and the independence assumption on the (Xt), t ∈ D, that the families

(Bt), t ∈ Dn,

(Xt), t ∈ Dn+1\Dn,

(Xt), t ∈ D\Dn+1,

are all independent, and therefore from the definition of Bt for t ∈ Dn+1 we infer that (b) holdsfor n+ 1 as well.In order to establish (a), observe that due to the recursive definition of (5.1.3), it is sufficientto show that the family of all increments (Bt+2−(n+1) − Bt), t ∈ Dn+1\1, is an independent

family, since they’re all N (0, 2−(n+1))-distributed. Since the vector formed by these incrementsis a Gaussian random vector (check!), it is in fact sufficient to show that they are pairwiseuncorrelated due to Lemma 5.1.4.Now for this purpose, we distinguish two cases:

93

(a) If t− s = 2−(n+1), then from (5.1.3) we infer that

Bt+2−(n+1) −Bt =Bt+2−n −Bt−2−n

2− 2−

n+12 ·Xt,

as well as

Bs+2−(n+1) −Bs =−Bt+2−n +Bt−2−n

2− 2−

n+12 ·Xt.

Now note that in both displays, the minuend on the right-hand side as well as the sub-trahend (in the outer difference) are N (0, 2−(n+1))-distributed. With Lemma 5.1.6 wetherefore infer that Bt+2−(n+1) −Bt and Bs+2−(n+1) −Bs are independent.

(b) If t− s > 2−(n+1), let m ∈ N be minimal such that there exists d ∈ Dm such that we havethe interval inclusion

[s, s+ 2−(n+1)] ⊂ [d− 2−m, d] and [t, t+ 2−(n+1)] ⊂ [d, d+ 2−m]

(convince yourself that such d exists). Note that

m < n+ 1. (5.1.4)

Since the increments over intervals of length 2−(n+1) which are contained in [d− 2−m, d],d ∈ Dm, are constructed using Bd+2−m −Bd as well as Xr, r ∈ [d− 2−m, d], and similarlyfor [d, d+ 2−m], we infer the desired independence.

This shows that (b) and (a) hold true for all n ∈ N.We now define our limiting process as follows. Start with introducing the functions

F0(t) :=

X1, if t = 1,

0, if t = 0,linear interpolation between X0 and X1, otherwise,

as well as

Fn(t) :=

2−n+1

2 Xt, if t ∈ Dn\Dn−1,0, if t ∈ Dn−1,

linear interpolation of X between consecutive points of Dn, otherwise.

for n ≥ 1, and note in passing that they are continuous on [0, 1].Then for t ∈ Dn it is not hard to see that we have

Bt =n∑i=0

Fi(t) =∞∑i=0

Fi(t). (5.1.5)

Our goal now is to show that the infinite sum converges uniformly on [0, 1] and hence the limitis a continuous function again.For this purpose we notice that as a consequence of Lemma 5.1.3, for any C ∈ (0,∞) fixed, wehave

P(|Xt| ≥ C√n) ≤ e−

C2n2 (5.1.6)

for all n ≥ N, some N ∈ N.Hence, we infer∑

n∈N0

P(∃t ∈ Dn such that |Xt| ≥ C

√n)≤∑n≤N

(2n + 1) +∑n>N

(2n + 1)e−C2n

2 .

94

Now for C >√

2 ln 2 the series on the right-hand side converges and therefore the Borel-Cantellilemma implies that there exists a random but finite N such that for all n ≥ N,

∀t ∈ Dn |Xt| ≤ C√n,

so‖Fn‖∞ ≤ C

√n · 2−

n2 .

This entails that P-a.s., the infinite sequence on the right-hand side of (5.1.5) converges uni-formly on [0, 1], and the continuous limit on [0, 1] is what we define (Bt), t ∈ [0, 1], to be.It remains to check that the increments of (Bt), t ∈ [0, 1], have the required finite-dimensionaldistributions. For this purpose, let 0 ≤ t1 < . . . < tn ≤ 1, and let for each i ∈ 1, . . . , n, let

(t(i)k ), k ∈ N, be a D-valued sequence such that (t

(i)k )→ ti as k →∞. Then, for all k large enough,

(Bt(i+1)k

−Bt(i)k

) ∼ N (0, t(i+1)k − t(i)k ), and they form an independent family in i ∈ 1, . . . , n− 1.

Furthermore, P-a.s., as k →∞,

(Bt(i+1)k

−Bt(i)k

)→ (Bti+1 −Bt(i)).

Using e.g. characteristic functions and Theorem 2.6.3 of [Dre19], it is not hard to show that(Bti+1 − Bt(i)) ∼ N (0, ti+1 − ti), and that the (Bti+1 − Bt(i)), i ∈ 1, . . . , n − 1, form anindependent family, which finishes the proof.

5.2 Some properties of Brownian motion

Proposition 5.2.1 (Scaling invariance). If c > 0 and (Bt) is a Brownian motion starting inx ∈ Rd, then the process (1

cBc2t) is a Brownian motion starting in xc .

Proof. The only property of Definition 3.2.18 that is not clear is Part (c). But by assumptionwe have that Bc2t −Bc2s ∼ N (0, c2(t− s)Id), and hence we have that

1

cBc2t −

1

cBc2s ∼ N (0, (t− s)Id),

which finishes the proof.

Remark 5.2.2. The above result gives rise to the term diffusive scaling, which means ‘speedingup’ time by a factor n2 and ‘shrinking’ via division by n, and then taking n → ∞. This is thesame scaling that you have seen in the central limit theorem (where the only notational differencewas that time was of order n and space was rescaled by n−1/2, which is the same in the sensethat time is rescaled by the inverse square of the rescaling of space).

Proposition 5.2.3 (Time inversion). If (Bt), t ∈ [0,∞), is Brownian motion, then so is (tB 1t),

t ∈ [0,∞), with 0B 10

:= 0.

Proof. Recall that the finite-dimensional distributions (Bt1 , . . . , Btn) of Brownian motion areGaussian random vectors and are therefore characterised by E[Bti ] = 0 and Cov(Bti , Btj ) = tifor 0 ≤ ti ≤ tj . Define now Xt := tB1/t with X0 = 0. Trivially, this process is also Gaussianand the corresponding Gaussian vectors have expectation zero. It remains to check what thecovariances are. For t > 0, h ≥ 0, they are given by

Cov(Xt+h, Xt) = (t+ h)tCov(B1/(t+h), B1/t) = t(t+ h)1

t+ h= t.

This confirms that the law of all finite-dimensional distributions of Xt are the same as forBrownian motion. The continuity of paths t → Xt is clear for t > 0. For t = 0 we use the

95

following two facts. First, since Q is countable, the distribution of Xt : t ≥ 0, t ∈ Q is thesame as for a Brownian motion and hence

limt→0t∈Q

Xt = 0 almost surely.

Secondly, Q ∩ (0,∞) is dense in (0,∞) and Xt : t ≥ 0 is almost surely continuous on (0,∞)so that

0 = limt→0t∈Q

Xt = limt→∞

Xt almost surely.

This proves the continuity of paths and yields the stated claim.

While so far we’ve mainly been collecting regularity properties of Brownian motion, we nowturn to a property which explains why I would fail miserably at sketching Brownian motion atthe blackboard.

Proposition 5.2.4. Let (Bt) be Brownian motion, and let [a, b] ⊂ [0,∞), a < b, be an interval.Then P-a.s., (Bt) is not monotone on the interval [a, b].

Proof. We give the proof for ‘increasing’, and the case of ‘decreasing’ follows from considering(−Bt), t ∈ [0,∞), using that this also constitutes a Brownian motion.

For arbitrary n ∈ N we partition [a, b] into the subintervals

[an,k, bn,k), k ∈ 0, . . . , n− 1,

where an,k := a+ kn(b−a), and bn,k := an,k + 1

n(b−a). Denoting by MI the event that Brownianmotion is increasing on the interval I, we have for each n ∈ N that

M[a,b] =n−1⋂k=0

M[an,k,bn,k).

Furthermore, since increments of Brownian motion are independent, we have that the event inthe intersection on the right-hand side are independent, and we also have

P(M[an,k,bn,k)) ≤ P(Bbn,k −Ban,k ≥ 0) =1

2,

since on any such interval the increments are centered Gaussian. All in all, this supplies us with

P(M[a,b]) =n−1∏k=0

P(M[an,k,bn,k)) ≤ 2−n ∀n ∈ N,

so P(M[a,b]) = 0 which is what we wanted to prove.

5.2.1 Law of the iterated logarithm

The law of the iterated logarithm is in some sense a regime that is in between that of the law oflarge numbers and the central limit theorem. In fact, while the (strong version of) the formerimplies Sn/n→ 0 almost surely under suitable circumstances, the latter states the convergencein distribution of Sn/

√n to a normal random variable.

In fact, for arbitrary C > 0 we have in combination with Corollary 2.1.6 of [Dre19] that

P(

lim supn→∞

Bn > C√n)≥ lim sup

n→∞P(Bn > C

√n) = P(B1 > C) ∈ (0, 1),

96

so Kolmogorov’s 0− 1-law implies that almost surely,

lim supn→∞

Bn√n

=∞.

On the other hand, we had already seen in Remark 3.8.2 that the sequence cannot converge inprobability either.The law of the iterated logarithm gives the right scaling (including constant) for it to converge.

Theorem 5.2.5 (Law of the iterated logarithm). Let (Bt) be one-dimensional Brownian motionstarting in x ∈ R. Then P-a.s.,

lim supt→∞

Bt√2t ln ln t

= 1. (5.2.1)

Proof. Since Brownian motion starting in x can be obtained by shifting standard Brownianmotion by x, and since the influence of the starting point vanishes in (5.2.1), it is sufficient toprove the result for standard Brownian motion.We start with showing that P-a.s.,

lim supt→∞

Bt√2t ln ln t

≤ 1 (5.2.2)

by chopping time into intervals which are growing exponentially. More precisely, for α > 1 weset tn := αn as well as f(t) := 2α2 ln ln t. Since t 7→ tf(t) is increasing, we observe

P(

maxtn≤s≤tn+1

Bs√2s ln ln s

> α)≤ P

(max

tn≤s≤tn+1

Bs >√tnf(tn)

). (5.2.3)

Then, using scale invariance of Brownian motion in the first equality, the reflection principleTheorem 5.2.12 and Lemma 5.1.3 to obtain the second inequality, and the identity f(tn)

2α =α(ln(n lnα)) = α lnn+ α ln lnα in the penultimate step, we get

P(

maxtn≤s≤tn+1

Bs >√tnf(tn)

)≤ P

( 1√tn+1

max0≤s≤tn+1

Bs >

√f(tn)

α

)= P

(max

0≤s≤1Bs >

√f(tn)

α

)≤ 2√

√α

f(tn)e−

f(tn)2α ≤

√α

f(tn)(lnα)−αn−α ≤ n−α,

where the latter inequality holds true for all n large enough.Now since α > 1, the right-hand side is summable in n, and we obtain using the Borel-CantelliLemma 1.13.6 that almost surely,

lim supt→∞

Bt√tf(t)

< α.

Taking α ↓ 1 we recover (5.2.2).It remains to show the lower bound

lim supt→∞

Bt√2t ln ln t

≥ 1 (5.2.4)

For this purpose, we will take α → ∞ in the above scales. Define, with α and tn as above thequantities α := α−1

α < 1 and g(t) := 2α2 ln ln t. Using that tn+1 − tn = αn(α − 1) = tn+1α,recalling Lemma 5.1.3 (and taking advantage of the fact that 1

x −1x3 ≥ 1

2x for x ≥ 1 in thatbound) we obtain

P(Btn+1 −Btn >√tn+1g(tn+1)) = P

(Btn+1 −Btn√tn+1 − tn

>

√tn+1g(tn+1)

tn+1 − tn

)= P(B1 >

√g(tn+1)/α)

≥ 1√2π

1

2

√α

g(tn+1)e−

g(tn+1)

2α =1√2π

1

2

√α

g(tn+1)(lnα)−α(n+ 1)−α.

97

Since α < 1 we infer that the right-hand side summed over n yields∞, so, using the independenceof the respective events, the Borel-Cantelli Lemma supplies us with

P(Btn+1 −Btn >

√tn+1g(tn+1)︸ ︷︷ ︸

=:Gn

for infinitely many n)

= 1. (5.2.5)

Now using the upper bound (5.2.2) in combination with the symmetry of Brownian motion andthe fact that

limn→∞

tn+1 ln ln tn+1

tn ln ln tn= α,

we infer that for ε > 0, P-a.s.,

Btn > −1 + ε√α

√2tn+1 ln ln tn+1

for all n large enough. Using this in combination with (5.2.5) we infer that P-a.s., for all n largeenough, we have that on Gn,

Btn+1 −Btn +Btn ≥√tn+1g(tn+1)− 1 + ε√

α

√2tn+1 ln ln tn+1

and hence

lim supn→∞

Btn√2tn ln ln tn

≥ α− 1 + ε√α

=α− 1

α− 1 + ε√

α.

Taking α→∞ we recover (5.2.4), which finishes the proof.

5.2.2 Non-differentiability of Brownian motion

Corollary 5.2.6 (non-differentiability of Brownian motion). For t ∈ [0,∞) fixed, P-a.s., Brow-nian motion is not differentiable at t. More precisely, we even have that P-a.s.,

lim suph↓0

Bt+h −Bth

= − lim infh↓0

Bt+h −Bth

=∞.

Remark 5.2.7. The fact that P-a.s., sample paths of Brownian motion are continuous (evenα-Holder continuous for α ∈ (0, 1

2), cf. Theorem 3.2.20) but nowhere differentiable, is quiteinteresting. Indeed, if you want to construct examples of such functions explicitly (see e.g. the‘Weierstraß function’), this is non-trivial.

Proof of Corollary 5.2.6. In combination with Proposition 5.2.3 we infer

lim suph↓0

Bh −B0

h= lim sup

h↓0

hBh−1

h= lim sup

s→∞Bs =∞,

where the last equality is due to Theorem 5.2.5, and similarly for the limes inferior, which entailsthe statement for t = 0. For general t we observe that the process (Bs+t − Bt), s ∈ [0,∞), is astandard Brownian motion again (cf. Exercise 5.1.1), and this finishes the proof.

Theorem 5.2.8 (Blumenthal’s 0-1 law). Let (Bt) be Brownian motion. Write

F+0 :=

⋂t>0

Ft,

where (Ft) denotes the canonical σ-algebra of (Bt), and F+0 is oftentimes referred to as the germ

σ-algebra.Then F+

0 is P-trivial.

98

In order to prove Theorem 5.2.8 let us first generalise slightly the germ σ-algebra and set

F+s =

⋂t>s

Ft.

Clearly, due to the growing nature of filtrations, we have that Fs ⊂ F+s . Due to the fact that

F+s might be bigger, it is not immediately clear anymore that the process (Bt+s − Bs)t≥0 is

a Brownian motion independent of the σ-algebra F+s (due to the independence of increments,

this statement is trivially true for the σ-algebra Fs). It turns out however, that we retainindependence also for the slightly bigger σ-algebra F+

s .

Theorem 5.2.9. For every s ≥ 0, the process (Bt+s − Bs)t≥0 is a standard Brownian motionindependent of F+

s .

Proof. Suppose sn > s such that sn ↓ s. By the Markov property of Brownian motion, theprocess (Bt+sn − Bsn)t≥0 is a standard Brownian motion independent of Fsn . By continuity ofBrownian motion, for all tj > tj−1 > · · · > t1 > 0, we have

(Btj+s −Btj−1+s, . . . Bt2+s −Bt1+s) = limn→∞

(Btj+sn −Btj−1+sn , . . . Bt2+sn −Bt1+sn),

so we infer that (Bt+s −Bs)t≥0 is independent of⋂nFsn = F+

s .

Proof of Theorem 5.2.8. For s = 0, we have that any A ∈ σ(Bt : t ≥ 0) is independent of F+0

by the preceding theorem. This applies in particular to A ∈ F+0 , which makes it independent of

itself and must therefore have probability 0 or 1.

Remark 5.2.10. We might ask ourselves whether the σ-algebras F+s contribute anything to the

understanding of Brownian motion beyond their applications to P-trivial events. An affirmativeanswer to this question follows immediately if we consider for example the random variableτA = inft ≥ 0 : Bt ∈ A for an open set A ⊂ R (for simplicity, let B0 6∈ A). Then it caneasily be verified that τA is a stopping time with respect to the filtration (F+

s )s≥0, but not withrespect to (Fs)s≥0. The intuitive reason for this is due to the fact that we can only learn whetherB has entered an open set by looking infinitesimally into the future, where this foresight is notneeded for closed sets (which we already know since for closed sets τA is a stopping time alsowith respect to (Fs)s≥0).

5.2.3 The strong Markov property of Brownian motion

We already know that Brownian motion satisfies the Markov property, which can be interpretedas Brownian motion “restarting” itself at each deterministic time. Crucially, for Brownianmotion this is true also for stopping times.

Theorem 5.2.11 (Strong Markov property). For every almost surely finite stopping time τ ,the process (Bτ+t −Bτ )t≥0 is a standard Brownian motion independent of F+

τ .

Proof. We first show the statement for discrete approximations τn of τ , i.e. for τn = (m+ 1)2−n

if m2−n ≤ τ < (m + 1)2−n. Write Bk = (Bkt )t≥0 for the Brownian motion defined by Bk

t =Bt+k/2n −Bk/n2 and B∗ = (B∗t )t≥0 for the process defined by B∗t = Bt+τn −Bτn . Let E ∈ F+

τn .It then holds for every event B∗ ∈ A

P(B∗ ∈ A ∩ E) =

∞∑k=0

P(Bk ∈ A ∩ E ∩ τn = k2−n)

=

∞∑k=0

P(Bk ∈ A)P(E ∩ τn = k2−n),

99

where we used the independence of Bk ∈ A from E ∩ τn = k2−n ∈ F+k2−n , which we have

due to the preceding theorem. Using the (normal) Markov property, we have that P(Bk ∈ A) =P(B ∈ A) and therefore does not depend on k. This gives

∞∑k=0

P(Bk ∈ A)P(E ∩ τn = k2−n) = P(B ∈ A)∞∑k=0

P(E ∩ τn = k2−n)

= P(B ∈ A)P(E),

which shows that B∗ is a Brownian motion and independent of E, hence of F+τn .

We now generalise this statement to a general stopping time τ . As τn ↓ τ we have that

(Bt+τn −Bτn)s≥0

is a Brownian motion independent of F+τn ⊃ F

+τ . Therefore, the increments

Bs+t+τ −Bt+τ = limn→∞

Bs+t+τn −Bt+τn

of the process (Br+τ − Bτ )r≥0 are independent and normally distributed with mean zero andvariance s. Since the process is trivially almost surely continuous, it is a Brownian motion.Furthermore, all increments Bs+t+τ −Bt+τ = limBs+t+τn −Bt+τn are independent of F+

τ , as istherefore the process itself.

Theorem 5.2.12 (Reflection principle for Brownian motion (essentially D. Andre, 1887)).Let (Bt) be standard Brownian motion. Then for t ∈ (0,∞) and M > 0 we have

P( sups∈[0,t]

Bs ≥M) = 2P(Bt ≥M).

In order to prove Theorem 5.2.12, we will first show that taking a Brownian motion and reflectingit upon hitting some value results in a process that is itself a Brownian motion. More precisely,we have the following result.

Theorem 5.2.13. If τ is a stopping time and (Bt)t≥0 is a standard Brownian motion, then theprocess (B∗t )t≥0 called the Brownian motion reflected at τ and defined by

B∗t = Bt1t≤τ + (2Bτ −Bt)1t>τ

is also a standard Brownian motion.

Proof. On the event that τ is finite, by the strong Markov property both paths

(Bt+τ −Bτ )t≥0 and − (Bt+τ −Bτ )t≥0

are Brownian motions and independent of the beginning (Bt)0≤t≤τ . The concatenation mapping,which takes a continuous path (gt)t≥0 and glues it to the end point of a finite continuous path(ft)0≤t≤τ to form a new continuous path, is measurable. Therefore, the process arising fromglueing the first path of the pair above to (Bt)0≤t≤τ and the process arising from gluing thesecond path of the pair to (Bt)0≤t≤τ have the same distribution. Clearly the first case results inthe process (Bt)t≥0 and the second to (B∗t )t≥0. On the event that τ is infinite, the two processesagree by definition.

Proof of Theorem 5.2.12. Let τ = inft ≥ 0 : Bt = M and let B∗ be the Brownian motionreflected at stopping time τ . Then

sups∈[0,t]

Bs ≥M = Bt ≥M ∪ sups∈[0,t]

Bs ≥M,Bt < M.

This is by choice a disjoint union and the second event coincides with the event B∗t > M. Bythe preceding theorem we obtain the desired result.

100

5.2.4 The zero set of Brownian motion

What we will show in this section is the perhaps surprising property of standard Brownianmotion, that the set of points t ≥ 0 : Bt = 0 is closed and contains no isolated points (suchsets are sometimes called perfect sets). This is especially surprising, as the Brownian motionalmost surely has zeroes that are isolated from the left (e.g. the first zero after time 1/2) orfrom the right (e.g. last zero before 1/2). As a direct consequence we also get that the set ofsuch zeroes is uncountable, which one would not expect for a continuous function that is P-a.s.not monotone on any interval (as per Proposition 5.2.4).

Theorem 5.2.14. Let (Bt)t≥0 be a one dimensional standard Brownian motion and

A = t ≥ 0 : Bt = 0

its zero set. Then A is almost surely a closed set with no isolated points.

Proof. Since Brownian motion is almost surely continuous, we have that with probability one,A is closed. To prove the second property, take for each q ∈ Q ∩ [0,∞) the first zero after q,i.e. τq = inft ≥ q : Bt = 0 and note that τq is an almost surely finite stopping time. Since Ais almost surely closed, the infimum is achieved and is therefore almost surely a minimum. Bythe strong Markov property applied to τq, we have that for each q almost surely τq is not anisolated zero from the right. Therefore, almost surely for all of the countably many rationals q,τq is not isolated from the right. For 0 < t ∈ A that are different from τq for all q ∈ Q we nowshow they are not isolated from the left. To see this, take Q 3 qn ↑ t and consider tn := τqn . Itholds that qn ≤ tn < t and therefore tn ↑ t, so t is not isolated from the left.

Corollary 5.2.15. Let A be the zero set of standard Brownian motion. Then A is uncountable.

Proof. The uncountability of A follows from the known result that perfect sets are uncountable,which we now prove. We will do this by constructing a subset of A with cardinality 1, 2Nwhich is known to be uncountable. Start by by choosing a point x1 ∈ A. As this point is notisolated there exists a second point x2 ∈ A. Take now two disjoint balls B1 and B2 around thesetwo points. As x1 is not isolated, we can find two points in B1 ∩ A around which we can placetwo disjoint balls contained in B1with no more than half its radius; we do similarly for B2 ∩ Aand so on. There now exists a bijection between 1, 2N and the decreasing sequence of balls.As A is complete the intersection of the balls in each such sequence contains at least one point ofA and two points belonging to two different sequences are different. This proves our claim.

5.3 Local central limit theorem

The presentation of this section is strongly guided by Section 2 of [LL10].The local central limit theorem is important in many contexts in probability and arises quitenaturally from a finer investigation of what’s happening at microscopic scales in the standardcentral limit theorem (see Theorem [Dre19, Theorem 3.8.1]) applied to certain random walks.Let (Xn) be an i.i.d. sequence of Zd-valued random variables and denote by Sn :=

∑nk=1Xk the

corresponding random walk. For simplicity denote by pn the distribution P(Sn ∈ ·) of Sn on(Zd,B(Zd)).Furthermore, denote by

pn(x) :=1

(2πn)d2

√det Σ

e−(x,Σ−1x)

2n

the probability density of a N (0, nΣ)-distributed Rd-valued normal random variable. As youknow from the basic probability theory lectures, this is in fact the distribution of the sum ofn independent N (0,Σ)-distributed random variables. Also, recall that the gist of the central

101

limit theorem was that asymptotically, for partial sums Sn of suitable random variables (keepingnotation one-dimensional for simplicity),

Sn − E[Sn]√Var(Sn)

is well-approximated by a N (0, 1)-variable. While this is in some sense a ‘macroscopic’ result(since space is rescaled by

√Var(Sn) which is of order

√n), the local central limit theorem gives

insight into the ‘microscopic ’ behaviour in the sense that Sn − E[Sn] is well approximated bya N (0,Var(Sn))-distributed random variables (heuristically, this is a multiplication of the scalein the central limit theorem by

√Var(Sn)).

Theorem 5.3.1 (Local central limit theorem (LCLT), Theorem 2.3.5 in [LL10]). Let Sn :=∑nk=1Xk be a random walk such that E[X1] = 0 ∈ Rd, and Var(πi(X1)) ∈ (0,∞) for all

i ∈ 1, . . . , d (where πi denotes the projection on the i-th coordinate), and such that

E[|X1|3] <∞. (5.3.1)

Furthermore, assume that (Sn) is an aperiodic and irreducible random walk on Zd, i.e., for eachx ∈ Zd there exists Nx ∈ N such that pn(x) > 0 for all n ≥ Nx.

Then, with the above notation, there exists a constant C ∈ (0,∞) such that for all n ∈ N andx ∈ Zd,

|pn(x)− pn(x)| ≤ C

n(d+1)/2.

As in the case of the standard central limit theorem (as well as the one for martingales), theproof will be based on computations of characteristic functions. Indeed, we have the followingauxiliary result.

Lemma 5.3.2 (Lemma 2.3.3 of [LL10]). Let X := X1 with X1 as in Theorem 5.3.1. Using theTaylor expansion of the exponential function around 0 in combination with (5.3.1), we write thecharacteristic function as

ϕX(t) = 1− (t,Σt)

2+ g(t), t ∈ Rd, (5.3.2)

for some function g : Rd → C with

g(t) ∈ O(|t|3) as t→ 0. (5.3.3)

Then there exist ε, C ∈ (0,∞) such that for all n ∈ N, |t| ≤ ε√n, we have(

ϕX

( t√n

))n= exp

− (t,Σt)

2+ h(t, n)

,

where

|h(t, n)| ≤ min(t,Σt)

4, n∣∣∣g( t√

n

)∣∣∣+C|t|4

n

. (5.3.4)

Proof. From (5.3.2) we infer by taking the logarithm and using its second order Taylor expansionaround 1 that

ln(ϕX(t)) = −(t,Σt)

2+ g(t)− (t,Σt)2

8+O(|g(t)| |t|2) +O(|t|6).

102

Hence, multiplying this display by n and at the same time replacing t by t/√n, we infer that

n ln(ϕX

( t√n

))= −(t,Σt)

2+ ng

( t√n

)− (t,Σt)2

8n+O

( ∣∣∣g( t√n

)∣∣∣ · |t|2)+ n−2O(|t|6)︸ ︷︷ ︸=:h(t,n)

.

From this definition of h, in combination with (5.3.3) it immediately follows that we can findε > 0 such that for all t with |t| ≤ ε

√n, we have that |h(t, n)| is bounded from above by the

second term in the minimum of the right-hand side of (5.3.4). Choosing ε even smaller, we alsoget that |h(t, n)| is smaller than the right-hand side of (5.3.4) by using that g(t) ∈ o(|t|2).

Lemma 5.3.3 (Lemma 2.3.4 of [LL10]). Assume the setting and notation from Lemma 5.3.2.Then there exist C ′, γ ∈ (0,∞) such that for all r ∈ [0, ε

√n], defining δn(x, r) via

pn(x) = pn(x) + δn(x, r) +1

(2π)dnd/2

∫|t|≤r

exp− ix · t√

n− (t,Σt)

2

(eh(t,n) − 1)λd(dt),

then

|δn(x, r)| ≤ C ′n−d2 e−γr

2.

For the proof of Lemma 5.3.3 we recall the Fourier inversion formula Corollary 3.8.6 of [Dre19].While a standing assumption in that result was that the measure to which it was appliedhad a density with respect to λd, the result can be extended to cover the case of discreterandom variables, and we can therefore infer, using that the characteristic function ‘turns sumsof independent random variables into products’, that we have

pn(x) =1

(2π)d

∫[−π,π)d

ϕnX1(t)e−it·x λd(dt) (5.3.5)

(see e.g. display (2.12) in [LL10]).Also, we remark that for each ε > 0,

sup|ϕ(t)| : t ∈ [−π, π]d, |t| ≥ ε

< 1, (5.3.6)

as well as the fact that

1

(2π)dnd2

∫Rd

e−it·x/√ne−

(t,Σt)2 λd(dt) = pn(x), (5.3.7)

which follows from explicit calculation (see e.g. display (2.2) in [LL10]).Scrutinising (5.3.7) we can upper bound part of the integral via∣∣∣ 1

(2π)dnd2

∫|t|≥ε

√n

e−it·x/√ne−

(t,Σt)2 λd(dt)

∣∣∣ ≤ 1

(2π)dnd2

∫|t|≥ε

√n

e−(t,Σt)

2 λd(dt) ≤ O(e−εn), (5.3.8)

which turns out to be negligible as n→∞.

Proof of Lemma 5.3.3. In order to employ the asymptotic behaviour established in (5.3.2), weuse (5.3.5) and the transformation formula to infer that

pn(x) =1

(2π)d

∫[−π,π)d

ϕnX(t)e−it·x λd(dt) =1

(2π)dnd2

∫[−√nπ,√nπ)d

ϕnX(t/√n)e−it·x/

√n λd(dt).

103

In combination with (5.3.6) we find that for ε > 0 there exists c > 0 such that |ϕX(t)| ≤ e−c forall t ∈ [−π, π)d with |t| ≥ ε. Hence, the previous display can be continued as

=1

(2π)dnd2

∫|t|≤ε

√n

ϕnX(t/√n)e−it·x/

√n λd(dt) +O((e−c)

n). (5.3.9)

Using Lemma 5.3.2 we have

ϕnX(t/√n) = e−

(t,Σt)2 + e−

(t,Σt)2 (eh(n,x) − 1).

As a consequence, plugging this into (5.3.9) and combining it with (5.3.8) and (5.3.7), we inferthat

pn(x) = pn(x) +O(e−εn) +1

(2π)dnd/2

∫|t|≤ε

√n

exp− ix · t√

n− (t,Σt)

2

(eh(t,n) − 1)λd(dt),

which proves the result for r = ε√n. Otherwise, take advantage of

|eh(t,n) − 1| ≤ e(t,Σt)

4 + 1, (5.3.10)

which follows from (5.3.4). From this we then get∣∣∣ ∫r≤|t|≤ε

√n

exp− ix · t√

n− (t,Σt)

2

(eh(t,n) − 1)λd(dt)

∣∣∣ ≤ 2

∫|t|≥r

e−(t,Σt)

4 λd(dt) = O(e−δr2),

which finishes the proof.

Proof of Theorem 5.3.1. Choosing r = n18 in Lemma 5.3.3 we deduce

pn(x) = pn(x) +O(e−αn14 ) +

∫|t|≤n

18

exp− ix · t√

n− (t,Σt)

2

(eh(t,n) − 1)λd(dt)

where for |t| ≤ n18 we have with (5.3.3) and (5.3.4) that

|h(t, n)| ≤ C|t|3n−12 ,

and thus |eh(t,n) − 1| ≤ C|t|3/√n, which again implies that∫

|t|≤n18

exp− ix · t√

n− (t,Σt)

2

(eh(t,n) − 1)λd(dt) ≤ C√

n

∫Rd

|t|3e−(t,Σt)

2 λd(dt) ≤ C/√n,

and finishes the proof.

A more profound treatment of this important coupling can be found in [LL10, Chapter 7].

Acknowledgement: I would like to thank students of this class, as well as Alexei Prevorst andLars Schmitz, for several useful comments and for pointing out a variety of typos. Finally, manythanks to Prof. Alexander Drewitz for kindly letting me base this lecture on his lecture notes.

104

Bibliography

[Bau92] Heinz Bauer. Maß- und Integrationstheorie. de Gruyter Lehrbuch. [de Gruyter Text-book]. Walter de Gruyter & Co., Berlin, second edition, 1992.

[Ben38] Frank Benford. The law of anomalous numbers. Proceedings of the American Philo-sophical Society, 78(4):551–572, 1938.

[BH+11] Arno Berger, Theodore P Hill, et al. A basic theory of Benford’s Law. ProbabilitySurveys, 8(1):1–126, 2011.

[Bol68] L Boltzmann. Losung eines mechanischen problems. Wiener Berichte, 58:1035–1044,1868.

[DGRS12] Alexander Drewitz, Jurgen Gartner, Alejandro F. Ram´irez, and Rongfeng Sun.Survival probability of a random walk among a Poisson system of moving traps. InProbability in complex physical systems, volume 11 of Springer Proc. Math., pages119–158. Springer, Heidelberg, 2012.

[Dre18] Alexander Drewitz. Introduction to probability and statistics (lecture notes), 2018.

[Dre19] Alexander Drewitz. Probability theory I (lecture notes), 2019.

[Dud02] R. M. Dudley. Real analysis and probability, volume 74 of Cambridge Studies inAdvanced Mathematics. Cambridge University Press, Cambridge, 2002. Revisedreprint of the 1989 original.

[Dur10] Rick Durrett. Probability: theory and examples. Cambridge Series in Statistical andProbabilistic Mathematics. Cambridge University Press, Cambridge, fourth edition,2010.

[Fad85] Arnold M. Faden. The existence of regular conditional probabilities: necessary andsufficient conditions. Ann. Probab., 13(1):288–298, 1985.

[K16] Wolfgang Konig. The parabolic Anderson model. Pathways in Mathematics.Birkhauser/Springer, [Cham], 2016. Random walk in random potential.

[Kal02] Olav Kallenberg. Foundations of modern probability. Probability and its Applications(New York). Springer-Verlag, New York, second edition, 2002.

[Kin73] J. F. C. Kingman. Subadditive ergodic theory. Ann. Probab., 1:883–909, 1973. Withdiscussion by D. L. Burkholder, Daryl Daley, H. Kesten, P. Ney, Frank Spitzer andJ. M. Hammersley, and a reply by the author.

[Kle14] Achim Klenke. Probability theory. Universitext. Springer, London, second edition,2014. A comprehensive course.

[KS91] Ioannis Karatzas and Steven E. Shreve. Brownian motion and stochastic calculus,volume 113 of Graduate Texts in Mathematics. Springer-Verlag, New York, secondedition, 1991.

105

[L92] Paul Levy. Processus stochastiques et mouvement brownien. Les Grands ClassiquesGauthier-Villars. [Gauthier-Villars Great Classics]. Editions Jacques Gabay, Sceaux,1992. Followed by a note by M. Loeve, Reprint of the second (1965) edition.

[Lig85] Thomas M. Liggett. An improved subadditive ergodic theorem. Ann. Probab.,13(4):1279–1285, 1985.

[LL10] Gregory F. Lawler and Vlada Limic. Random walk: a modern introduction, volume123 of Cambridge Studies in Advanced Mathematics. Cambridge University Press,Cambridge, 2010.

[LPW09] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov chains and mixingtimes. American Mathematical Society, Providence, RI, 2009. With a chapter byJames G. Propp and David B. Wilson.

[MP10] Peter Morters and Yuval Peres. Brownian motion. Cambridge Series in Statisticaland Probabilistic Mathematics. Cambridge University Press, Cambridge, 2010. Withan appendix by Oded Schramm and Wendelin Werner.

[Str11] Daniel W. Stroock. Probability theory. Cambridge University Press, Cambridge,second edition, 2011. An analytic view.

[SV79] Daniel W. Stroock and S. R. Srinivasa Varadhan. Multidimensional diffusion pro-cesses, volume 233 of Grundlehren der Mathematischen Wissenschaften [Fundamen-tal Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 1979.

[Var01] S. R. S. Varadhan. Probability theory, volume 7 of Courant Lecture Notes in Math-ematics. New York University, Courant Institute of Mathematical Sciences, NewYork; American Mathematical Society, Providence, RI, 2001.

[Wil91] David Williams. Probability with martingales. Cambridge Mathematical Textbooks.Cambridge University Press, Cambridge, 1991.

106

Appendix A

Alternative constructions and proofs

A.1 Conditional expectation: a different approach

Theorem A.1.1. Let X ∈ L2(Ω,F ,P). Then, for any sub-σ-algebra G ⊂ F there exists aG-B(R)-measurable Y with E[Y 2] <∞ satisfying E[1GY ] = E[1GX] for all G ∈ G.

Proof. Consider the norm ‖Y ‖2 := E[Y 2]1/2 on the vector space V = L2(Ω,F ,P) of real valuedrandom variables Y with finite second moment, with inner product given by

〈Y1, Y2〉 = E[Y1Y2].

As Lp spaces are complete, this makes V into a Hilbert space. Then, U = L2(Ω,G,P) is acomplete and therefore closed subspace of V .By the existence of orthogonal projections onto closed subspaces of Hilbert spaces, there existsan orthogonal projection π : V → U . In particular, 〈πY − Y,Z〉 = 0 for all Y ∈ V and Z ∈ U .Setting Y = πX gives

E[1GY ]− E[1GX] = 〈1G, πX −X〉 = 0

as required.

Corollary A.1.2. Let X ∈ L1(Ω,F ,P). Then, for any sub-σ-algebra G ⊂ F there exists aG-B(R)-measurable Y with E[Y ] <∞ satisfying E[1GY ] = E[1GX] for all G ∈ G.

Proof. Let X = X+ − X−, where X± are non-negative, square integrable random variables.Define X±n = minX±, n. Since X±n is bounded, we have by Theorem A.1.1 that Y ±n =E[X± | G] exists and is unique a.s. Using the same steps as in the proof of Theorem 1.0.5 fornon-integrable X, we obtain the existence of Y ± and therefore also Y = Y + − Y −.

A.2 Martingales

Lemma A.2.1. A positive random variable τ is integrable if ∃N ∈ N, ε > 0 so that

P(τ ≤ n+N | τ > n) ≥ ε ∀n ∈ N0

Proof. We calculate

E[τ ] =

∫ ∞0

P(τ > t)dt =∑`∈N0

∫ (`+1)N

`NP(τ > t)dt ≤

∑`∈N0

P(τ > `N)

∫ (`+1)N

`Ndt

≤∑`∈N0

(1− ε)`N =N

1− (1− ε)=N

ε<∞,

107

which gives the desired result, once we argue the validity of the inequality

P(τ > `N) ≤ (1− ε)`.

We prove this inequality by induction. For ` = 0 the inequality trivially holds. For general `,note that τ > (`+ 1)N ⊆ τ > `N, so we can write

P(τ > (`+ 1)N) = P(τ > (`+ 1)N, τ > `N)

= P(τ > `N)P(τ > (`+ 1)N | τ > `N).

The assumption of the lemma gives P(τ > n + N | τ > n) ≤ (1 − ε). Choosing n = `Nyields P(τ > (` + 1)N | τ > `N) ≤ (1 − ε), which combined with the induction assumptionP(τ > `N) ≤ (1− ε)` yields the desired inequality.

108