st213 1993 lecture notes

Upload: sldkfjglskdjfg-akdfjgh

Post on 04-Mar-2016

11 views

Category:

Documents


0 download

DESCRIPTION

lecture notes here

TRANSCRIPT

  • ST213 Mathematics of Random EventsWilfrid S. Kendallversion 1.0 28 April 1999

    1. IntroductionThe main purpose of the course ST213 Mathematics of Random Events (which wewill abbreviate to MoRE) is to work over again the basics of the mathematics ofuncertainty. You have already covered this in a rough-and-ready fashion in:(a) ST111 Probability;and possibly in(b) ST114 Games and Decisions.

    In this course we will cover these matters with more care. It is important todo this because a proper appreciation of the fundamentals of the mathematics ofrandom events

    (a) gives an essential basis for getting a good grip on the basic ideas ofstatistics;

    (b) will be of increasing importance in the future as it forms the basis ofthe hugely important field of mathematical finance.

    It is appropriate at this level that we cover the material emphasizing conceptsrather than proofs: by-and-large we will concentrate on what the results say andso will on some occasions explain them rather than prove them. The third-yearcourses MA305 Measure Theory, and ST318 Probability Theory go into the matterof proofs. For further discussion of how Warwick probability courses fit together,see our road-map to probability at Warwick at

    www.warwick.ac.uk/statsdept/teaching/probmap.html

    1.1 Books

    [1] D. Williams (1991) Probability with Martingales CUP.

    1.2 Resources (including examination information)The course is composed of 30 lectures, valued at 12 CATS credit. It has an assessedcomponent (20%) as well as an examination in the summer term. The assessedcomponent will be conducted as follows: an exercise sheet will be handed outapproximately every fortnight, totalling 4 sheets. In the 10 minutes at the startof the next lecture you produce an answer to one question under examinationconditions, specified at the start of the lecture. Model answers will be distributed

    1

    ST111: The first-year core course introducing probability ideas

    ST114: The first-year optional course explaining how probability ideas arise naturally when trying to think rationally about uncertainty and about strategies in games of chance.

  • after the test, and an examples class will be held a week after the test. The testswill be marked, and the assessed component will be based on the best 3 out of 4of your answers.

    This method helps you learn during the lecture course so should:

    improve your exam marks; increase your enjoyment of the course; cost less time than end-of-term assessment.

    Further copies of exercise sheets (after they have been handed out in lectures!)can be obtained at the homepage for the ST213 course:

    www.warwick.ac.uk/statsdept/teaching/ST213.htmlThese notes will also be made available at the above URL, chapter by chapter

    as they are covered in lectures. Notice that they do not cover all the materialof the lectures: their purpose is to provide a basic skeleton of summary materialto supplement the notes you make during lectures. For example no proofs areincluded. In particular you will not find it possible to cover the course by ignoringlectures and depending on these notes alone!Further related material (eg: related courses, some pretty pictures of randomprocesses, ...) can be obtained by following links from: W.S. Kendalls homepage:

    www.warwick.ac.uk/statsdept/Staff/WSK/Finally, the Library Student Reserve Collection (SRC) will in the summer

    term hold copies of previous examination papers, and we will run two revisionclasses for this course at that time.

    1.3 Motivating ExamplesHere are some examples to help us see what are the issues.

    (1) J. Bernoulli (circa 1692): Suppose that A1, A2, ... are mutuallyindependent events, each of which has probability p. Define

    Sn = #{ events Ak which happen for k n} .Then the probability that Sn/n is close to p increases to 1 as n tendsto infinity:

    P [ |Sn/n p| ] 1as n for all > 0.

    (2) Suppose the random variable U is uniformly distributed over thecontinuous range [0, 1]. Why is it that for all x in [0, 1] we have

    P [U = x ] = 0

    and yetP [ a U b ] = b a

    2

  • whenever 0 a b 1? Why cant we argue as follows?

    P [ a U b ] = P x[a,b]

    {x}

    =x[a,b]

    P [U = x ] = 0 ?

    (3) The Banach-Tarski paradox. Consider a sphere S2. In a certainqualified sense it is possible to do the following curious thing: wecan find a subset F S2 and (for any k 3) rotations k1 , k2 , ...,kk such that

    S2 = k1 F k2 F ... kkF .What then should we suppose the surface area of F to be? SinceS2 = 31F 32F 33F we can argue for area(F ) = 1/3. But sinceS2 = 41F 42F 43F 44F we can equally argue for area(F ) = 1/4.Or similarly for area(F ) = 1/5. Or 1/6, or ...

    (4) Reverting to Bernoullis example (Example 1 above) we could ask,what is the probability that, when we look at the whole sequenceS1/1, S2/2, S3/3, ..., we see the sequence tends to p? Is this differentfrom Bernoullis statement?

    (5) Here is a question which is apparently quite different, which turnsout to be strongly related to the above ideas! Can we generalize theidea of a Riemann integral in such a way as to make sense of ratherdiscontinuous integrands, such as the case given below? 1

    0f(x) dx

    where

    f(x) ={ 1 when x is a rational number,

    0 when x is an irrational number.

    2. Probabilities, algebras, and -algebras

    2.1 MotivationConsider two coins A and B which are tossed in the air so as each to land witheither heads or tails upwards. We do not assume the coin-tosses are independent!

    3

    The "Riemann" integral uses the kind of integration you learnt at school, which works well for ordinary functions (polynomial, trigonometric, ...) but becomes awkward when applied to functions with discontinuities.

  • It is often the case that one feels justified in assuming the coins individually areequally likely to come up heads or tails. Using the fact P [A = T ] = 1P [A = H ],etc, we find

    P [A comes up heads ] =12

    P [B comes up heads ] =12

    To find probabilities such as P [HH ] = P [A = H,B = H ] we need to saysomething about the relationship between the two coin-tosses. It is often the casethat one feels justified in assuming the coin-tosses are independent, so

    P [A = H,B = H ] = P [A = H ] P [B = H ] .

    However this assumption may be unwise when the person tossing the coin is notexperienced! We may decide that some variant of the following is a better model:the event determining [B = H] is C if [A = H], D if [A = T ], where

    P [C = H ] =34

    P [D = H ] =14

    and A, C, D are independent.There are two stages of specification at work here. Given a collection C of

    events, and specified probabilities P [C ] for each C C, we can find P [Cc ] =1P [C ] the probability of the complement Cc of C, but not necessarily P [C D ]for C, D C.

    2.2 Revision of sample space and events

    Remember from ST111 that we can use notation from set theory to describeevents. We can think of events as subsets of sample space . If A is an event,then the event that A does not happen is the complement or complementary eventAc = { : 6 A}.

    If B is another event then the event that both A and B happen is the inter-section A B = { : A and B}. The event that either A or B (orboth!) happen is the union A B = { : A or B}.

    2.3 Algebras of setsThis leads us to identify classes of sets for which we want to find probabilities.

    4

    Unless we get lucky! if the intersection of the two sets actually belongs to the family of sets in question then we will have already specified the probability of the intersection ...

    In common usage the word "or" is ambiguous: it can be INCLUSIVE or EXCLUSIVE according to context. Example: Q: Do you want honey or jam on your bread? (intended exclusive or) A: both please! (assumes inclusive or)

    Mathematically "or" is always interpreted as inclusive UNLESS specified otherwise. So "A or B" means, "either A or B or both".

    The exclusive "or" is often signalled by the abbreviation EOR (for "exclusive or") or XOR. It is important in computing and electronic engineering, and in mathematics it is also called the "symmetric complement".

  • Definition 2.1 (Algebra of sets): An algebra (sometimes called a field) ofsubsets of is a class C of subsets of a sample space satisfying:(1) closure under complements: if A C then Ac C;(2) closure under intersections: if A, B C then A B C;(3) closure under unions: if A, B C then A B C.

    Definition 2.2 (Algebra generated by a collection): If C is a collection ofsubsets of then A(C), the algebra generated by C, is the intersection of allalgebras of subsets of which contain C.

    Here are some examples of algebras:

    (i) the trivial algebra A = {, };(ii) supposing = {H,T}, another example is

    A = { = {H,T}, {H}, {T}, } ;

    (iii) now consider the following class of subsets of the unit interval [0, 1]:A = { finite unions of subintervals }; This is an algebra. For exam-ple, if

    A = (a0, a1) (a1, a2) ... (a2n, a2n+1)is a non-overlapping union of intervals (and we can always re-arrangematters so that any union of intervals to be non-overlapping!) then

    Ac = [0, a0] [a1, a2] ... [a2n+1, 1] .

    This checks point (1) of the definition of an algebra of sets. Point(2) is rather easy, and point (3) is defined by points (1) and (2).

    (iv) Consider A = {{1, 2, 3}, {1, 2}, {3}, }. This is an algebra of subsetsof = {1, 2, 3}. Notice it does not include events such as {1}, {2, 3}.

    (v) Just to give an example of a collection of sets which is not an algebra,consider {{1, 2, 3}, {1, 2}, {2, 3}, }.

    (vi) Algebras get very large. It is typically more convenient simply togive a collection C of sets generating the algebra. For example, ifC = then A(C) = {,} is the trivial algebra described above!

    (vii) If = {H,T} and C = {{H}} then A = {{H,T}, {H}, {T}, } as inexample (ii) above.

    (viii) If = [0, 1] and C = { intervals in [0, 1] } then A(C) is the collectionof finite unions of intervals as in example (iii) above.

    (ix) Finally, if = {H,T} and C is the collection of points in [0, 1]then A(C) is the collection of (a) all finite sets in [0, 1] and (b) allcomplements of finite sets in [0, 1].

    5

    It is important to identify the sample space, as this defines the complement operation.

    Exercise: show properties (1) and (2) imply (3), while properties (1) and (3) imply (2).

    Explanation of terminology: in algebraic terms an algebra of subsets really is an algebraic algebra, with symmetric complement as "sum" and intersection as "multiplication"

    It is rare to list algebras of sets completely, as we do here. This is because there are normally too many sets in the algebra for this to be convenient.

  • In realistic examples algebras are rather large : not surprising, since theycorrespond to the collection of all true-or-false statements you can make about acertain experiment! (If your experiments results can be summarised as n differentyes/no answers such as, result is hot/cold, result is coloured black/white,etc then the relevant algebra is composed of 2n different subsets!) Therefore it isof interest that the typical element of an algebra can be written down in a ratherspecial form:

    Theorem 2.3 (Representation of typical element of algebra): If C is acollection of subsets of then the event A belongs to the algebra A(C) generatedby C if and only if

    A =Ni=1

    Mij=1

    Ci,j

    where for each i, j either Ci,j or its complement Cci,j belongs to C. Moreover wemay write A in this form with the sets

    Di =Mij=1

    Ci,j

    being disjoint. *

    We are now in a position to produce our first stab at a set of axioms forprobability. Given a sample space and an algebra A of subsets, probability P [ ]assigns a number between 0 and 1 to each event in the algebra A, obeying therules given below. There is a close analogy to the notion of length of subsets of[0, 1] (and also to notions of area, volume, ...): the table below makes this clear:

    Probability Length of subset of [0, 1]

    P [ ] = 0 Length () = 0

    P [ ] = 1 Length ([0, 1]) = 1

    P [A B ] = P [A ] + P [B ] Length ([a, b] [c, d]) =Length ([a, b]) + Length ([c, d])

    if A B = if a b < c d

    * This result corresponds to a basic remark in logic: logical statements, howevercomplicated, can be reduced to statements of the form (A1 and A2 and ... andAm) or (B1 and B2 and ... and Bn) or ... or (C1 and C2 and ... and Cp), wherethe statements A1 etc are either basic statements or their negations, and no morethan one of the (...) or ... or (...) can be true at once.

    6

  • There are some consequences of these axioms which are not completely trivial.For example, the law of negation

    P [Ac ] = 1 P [A ] ;and the generalized law of addition holding when AB is not necessarily empty

    P [A B ] = P [A ] + P [B ] P [A B ](think of double-counting); and finally the inclusion-exclusion law

    P [A1 A2 ... An ] =i

    P [Ai ]

    i 6=j P [Ai Aj ]+ ...+ (1)n P [A1 A2 ... An ] .

    2.4 Limit SetsMuch of the first half of ST111 is concerned with calculations using these variousrules of probabilistic calculation. Essentially the representation theorem above tellsus we can compute the probability of any event in A(C) just so long as we know theprobabilities of the various events in C and also of all their intersections, whetherby knowing events are independent or whether by knowing various conditionalprobabilities. *

    However these calculations can become long-winded and ultimately either in-feasible or unrevealing. It is better to know how to approximate probabilities andevents, which leads us to the following kind of question:

    Suppose we have a sequence of events Cn which are decreasing (getting harderand harder to satisfy) and which converge to a limit C:

    Cn C .Can we say P [Cn ] converges to P [C ]?

    Here is a specific example. Suppose we observe an infinite sequence of cointosses, and think therefore of the collection C of events Ai that the ith coin comesup heads. Consider the probabilities

    (a) P [ second toss gives heads ] = P [A2 ]

    * We avoid discussing conditional probabilities here for reasons of shortage oftime: they have been dealt with in ST111 and figure very largely in

    www.warwick.ac.uk/statsdept/teaching/ST202.html

    7

  • (b) P [ first n tosses all give heads ] = P [ni=1Ai ]

    (c) P [ the first toss which gives a head is even-numbered ]

    There is a difference! The first two can be dealt with within the algebra. Thethird cannot: suppose Cn is the event the first toss in numbers 1, ..., n whichgives a head is even-numbered or else all n of these tosses give tails, then Cn liesin A(C), and converges down to the event C the first toss which gives a head iseven-numbered, but C is not in A(C).

    We now find a number of problems raise their heads.

    Problems with everywhere being impossible: Suppose weare running an experiment with an outcome uniformly distributedover [0, 1]. Then we have a problem as mentioned in the second of ourmotivating examples: under reasonable conditions we are workingwith the algebra of finite unions of sub-intervals of [0, 1], and theprobability measure which gives P [ [a, b] ] = b a, but this meansP [ {a} ] = 0. Now we need to be careful, since if we rashly allowourselves to work with uncountable unions we get

    P

    x[0,1]

    {x} =

    x[0,1]0 = 0 .

    But this contradicts P [ [0, 1] ] = 1 and so is obviously wrong. Problems with specification: if we react to the above example

    by insisting we can only give probabilities to events in the originalalgebra, then we can fail to give probabilities to perfectly sensibleevents, such as in examples such as in (c) in the infinite sequence ofcoin-tosses above. On the other hand if we rashly prescribe prob-abilities then how can we avoid getting into contradictions such asabove?

    It seems sensible to suppose that at least when we have Cn C then we shouldbe allowed to say P [Cn ] P [C ], and this turns out to be the case as long as theset-up is sensible. Here is an example of a set-up which is not sensible:

    = {1, 2, 3, ...}, C = {{1}, {2}, ...}, P [n ] = 1/2n+1. Then A(C) is the collec-tion of finite and co-finite* subsets of the positive integers, and

    P [ {1, 2, ..., n} ] =n

    m=1

    1/2m+1 = (1/2) (1 1/2n+1) 1/2 6= 1 .

    We must now investigate how we can deal with limit sets.

    * co-finite: complement is finite

    8

  • 2.5 -algebrasThe first task is to establish a wide range of sensible limit sets. Boldly, we lookat sets which can be obtained by any imaginable combination of countable setoperations: the collection of all such sets is a -algebra.**

    Definition 2.4 (-algebra): A -algebra of subsets of is an algebra which isalso closed under countable unions.

    In fact -algebras are even larger than ordinary algebras; it is difficult todescribe a typical member of a -algebra, and it pays to talk about -algebrasgenerated by specified collections of sets.

    Definition 2.5 (-algebra generated by a collection): For any collection ofsubsets C of , we define (C) to be the intersection of all -algebras of subsets of which contain C:

    (C) ={S : S is a -algebra and C S} .

    Theorem 2.6 (Monotone limits): Note that (C) defined above is indeed a-algebra. Furthermore, it is the smallest -algebra containing C which is closedunder monotone limits.

    Examples of -algebras include: all algebras of subsets of finite sets (becausethen there will be no non-finite countable set operations); the Borel -algebra gen-erated by the family of all intervals of the real line; the -algebra for the coin-tossingexample generated by the infinite family of events Ai = [ ith coin is heads ].

    2.6 Countable additivityNow we have established a context for limit sets (they are sets belonging to a -algebra) we can think about what sort of limiting operations we should allow forprobability measures.

    Definition 2.7 (Measures): A set-function : A [0,] is said to be afinitely-additive measure if it satisfies:

    (FA) (A B) = A+ B whenever A, B are disjoint.It is said to be countably-additive (or -additive) if in addition

    (CA) i=1Ai =

    i=1 Ai whenever the Ai are disjoint and their union

    i=1Ai

    lies in A.We abbreviate finitely-additive to (FA), countably-additive to (CA).We often abbreviate countably-additive measure to measure.Notice that if A were actually a -algebra then we wouldnt have to check the

    condition i=1Ai lies in A in the third property.

    ** stands for countable

    9

    We also allow signed measures, taking negative values (e.g. for electrical charge, or difference between two measures).

    However signed measures can be tricky if they can take infinite values, as then we need to be cautious about "infinity minus infinity". So we do not discuss them here.

  • Definition 2.8 (Probability measures): A set-function P : A [0, 1] is said tobe a finitely-additive probability measure if it is a (FA) measure such that P [ ] =1. It is a (CA) probability measure (we often just say probability measure if inaddition it is (CA).

    Notice various consequences for probability measures: () = 0, condition (ii)follows from condition (iii) if condition (iii) holds, we always have (

    i=1 (Ai))

    i=1 (Ai) even when the union is not disjoint, etc.CA is a kind of continuity condition. A similar continuity condition is that of

    monotone limits.

    Definition 2.9 (Monotone limits): A set-function : A [0, 1] is said to obeythe monotone limits property (ML) if it satisfies:

    Ai Ai whenever the Ai increase upwards to a limit set A whichlies in A.

    (ML) is simpler to check than (CA) but is equivalent for finitely-additivemeasures.

    Theorem 2.10 (Equivalence for countable additivity):

    (FA) + (ML) (CA)

    Lemma 2.11 (Another equivalence): Suppose P is a finitely additive prob-ability measure on (,F), where F is an algebra of sets. Then P is countablyadditive if and only if

    limnP [An ] = 1

    whenever the sequence of events An belongs to the algebra F and moreover An .

    2.7 Uniqueness of probability measuresTo illustrate the next step, consider the notion of length/area. (To avoid awkwardalternatives, we talk about the measure instead of length/area /volume/...) It iseasy to define the area of very regular sets. But for a stranger, more fractal-like,set A we would need to define something like an outer-measure

    (A) = inf{

    (Bi) : where the Bi cover A}

    to get at least an upper bound for what it would be sensible to call the measureof A.

    Of course we must give equal priority to considering what is the measure ofthe complement Ac. Suppose for definiteness that A is contained in a simple set

    10

  • Q of finite measure (a convenient interval for length, a square for area, a cube forvolume, ...) so that Ac = Q \ A. Then consideration of (Ac) leads us directly toconsideration of inner-measure for A:

    (A) = (Q) (Ac) .

    Clearly (A) (A): moreover we can only expect a truly sensible definitionof measure on the set

    F = {A : (A) = (A)} .The fundamental theorem of measure theory states that this works out all

    right!

    Theorem 2.12 (Extension theorem): If is a measure on an algebra A whichis -additive onA then it can be extended uniquely to a countable additive measureon F defined as above: moreover (A) F .

    The proof of this remarkable theorem is too lengthy to go into here. Noticethat it can be paraphrased very simply: if your notion of measure (probability,length, area, volume, ...) can be defined consistently on an algebra in such a waythat it is -additive whenever the two sides

    ( i=1

    Ai

    )=

    i=1

    Ai

    make sense (whenever the disjoint unioni=1Ai actually belongs to the algebra),

    then it can be extended uniquely to the (typically much larger) -algebra generatedby the original algebra, so as again to be a (-additive) measure.

    There is an important special part of this theorem which is worth statingseparately.

    Definition 2.13 (-system): A -system of subsets of is a collection of subsetsincluding itself and closed under finite intersections.

    Theorem 2.14 (Uniqueness for probability measures): Two finite measureswhich agree on a pi-system also agree on the generated -algebra ().

    2.8 Lebesgue measure and coin tossingThe extension theorem can be applied to the uniform probability space =[0, 1], A given by finite unions of intervals, P given by lengths of intervals. Itturns out P is indeed -additive on A (showing this is non-trivial!) and so theextension theorem tells us there is a unique countably additive extension P on

    11

  • the -algebra B = (A) (the Borel -algebra restricted to [0, 1]). We call thisLebesgue measure.

    There is a significant connection between infinite sequences of coin tosses andnumbers in [0, 1]. Briefly, we can expand a number x [0, 1] in binary (as opposedto decimal!): we write x as

    .123...

    where i equals 1 or 0 according as 2ix is greater than 1 or not. The coin-tossing-algebra can be viewed as generated by the sequence

    {1, 2, 3, ...}

    with 0 standing for tails, 1 for heads. In effect we get a map from coin-tossingspace 2N to number space [0, 1] with the slight cautionary note that this mapvery occasionally maps two sequences onto one number (think of .0111111... and.100000...). In particular

    [1 = a1, 2 = a2, ..., d = ad] = [x, x+ 2d)

    where x is the number corresponding to (a1, a2, ..., ad).Remarkably, we can now use the uniqueness theorem to show that the map

    T : (a1, a2, ..., ad) 7 x preserves probabilities, in the sense that Lebesgue measureis exactly the same as we get by finding the probability of the event T1(A) as acoin-tossing event, if the coins are independent and fair.

    It is reasonable to ask whether there are any non-measurable sets, since -algebras are so big! It is indeed very hard to find any. Here is the basic example,which is due in essence to Vitali.

    Consider the following equivalence relation on (,B,P): we say x y if x yis a rational number. Now construct a set A by choosing exactly one member fromeach equivalence class.

    So for any x [0, 1] there is one and only one y A such that x y is arational number.

    If A were Lebesgue measurable then it would have a value P [A ]. What wouldthis value be?

    Imagine [0, 1] folded round into a circle. It is the case that P [A ] does notchange when one turns this circle. In particular we can now consider Aq = {a+ q :a A} for rational q. By construction Aq and Ar are disjoint for different rationalq, r. Now we have

    q rationalAq = [0, 1]

    12

    You should feel uncomfortable about this step. It involves making infinitely many - indeed uncountable many! - choices; profoundly impractical!

    We say, if we allow ourselves to consider this step, that we are assuming the AXIOM OF CHOICE. Briefly, it results in valid conclusions when used in mathematical arguments, but is non-constructive.

  • and since there are only countably many rational q, and P [Aq ] doesnt depend onq, we determine

    P [ [0, 1] ] =

    q rationalP [Aq ] =

    q rational

    P [A ] .

    But this cannot make sense if P [ [0, 1] ] = 1! We are forced to conclude thatA cannot be Lebesgue measurable.

    This example has a lot to do with the Banach-Tarski paradox described inone of our motivating examples above.

    3. Independence and measurable functions

    3.1 IndependenceIn ST111 we formalized the idea of independence of events. Essentially we requirea multiplication law to hold:

    Definition 3.15 (Independence of an infinite sequence of events): We saythe events Ai (for i = 1, 2, ...) are independent if, for any finite subsequencei1 < i2 < ... < ik we have

    P [Ai1 ... Aik ] = P [Ai1 ] ... P [Aik ]

    Notice we require all possible multiplication laws to hold: it is possible to buildinteresting examples where events are independent pair-by-pair, but altogether givenon-trivial information about each other.

    We need to talk about infinite sequences of events (often independent). Weoften have in the back of our minds a sense that the sequence is revealed to usprogressively over time (though this need not be so!), suggesting two natural ques-tions. First, will we see events occur in the sequence right into the indefinitefuture? Second, will we after some point see all events occur?

    Definition 3.16 (Infinitely often and Eventually): Given a sequence ofevents B1, B2, ... we say

    Bi holds infinitely often ([Bi i.o.]) if there are infinitely many differ-ent i for which the statement Bi is true: in set-theoretic terms

    [Bi i.o.] =i=1

    j=i

    Bj .

    13

    It has been shown that if we don't assume the Axiom of Choice then we CAN assume that all sets are Lebesgue measurable!

    This isn't as useful as it seems: it means that the difficulties we are dealing with here are profound and have a lot to do with the foundations of mathematics.

  • Bi holds eventually ([Bi ev.]) if for all large enough i the statementBi is true: in set-theoretic terms

    [Bi ev.] =i=1

    j=i

    Bj .

    Notice these two concepts ev. and i.o. make sense even if the infinite sequenceis just a sequence, with no notion of events occurring consecutively in time!

    Notice (you should check this yourself!)

    [Bi i.o.] = [Bci ev.]c .

    3.2 Borel-Cantelli lemmasThe multiplication laws appearing above in Section 2.1 force a kind of infinitemultiplication law.

    Lemma 3.17 (Probability of infinite intersection): If the events Ai (fori = 1, 2, ...) are independent then

    P

    [ i=1

    Ai

    ]=

    i=1

    P [Ai ]

    We have to be careful what we mean by the infinite producti=1 P [Ai ]: we

    mean of course the limiting value limnni=1 P [Ai ].

    We can now prove a remarkable pair of facts about P [Ai i.o. ] (and hence itstwin P [Ai ev. ]!). It turns out it is often easy to tell whether these events haveprobability 0 or 1.

    Theorem 3.18 (Borel-Cantelli lemmas): Suppose the events Ai (for i = 1, 2,...) form an infinite sequence. Then

    (i) ifi=1 P [Ai ]

  • Theorem 3.19 (Law of large numbers for events): Suppose that we have asequence of independent events Ai each with the same probability p. Let Sn countthe number of events A1, ...,, An which occur. Then

    P[ Snn p

    ev. ] = 1for all positive .

    3.4 Independence and classes of eventsThe idea of independence stretches beyond mere sequences of events. For example,consider (a) a set of events concerning a football match between Coventry City andAston Villa at home for Coventry, and (b) a set of events concerning a cricket testbetween England and Australia at Melbourne, both happening on the same day.At least as a first approximation, one might assume that any combination of eventsconcerning (a) is independent of any combination concerning (b).

    Definition 3.20 (Independence and classes of events): Suppose C1, C2 aretwo classes of events. We say they are independent if A and B are independentwhenever A C1, B C2.

    Here our notion of -systems becomes important.

    Lemma 3.21 (Independence and -systems): If two -systems are indepen-dent, then so are the -algebras they generate.

    Returning to sequences, the above is the reason why we can jump immediatelyfrom assumptions of independence of events to deducing that their complementsare independent.

    Corollary 3.22 (Independence and complements): If a sequence of eventsAi is independent, then so is the sequence of complementary events Aci .

    3.5 Measurable functionsMathematical work often becomes easier if one moves from sets to functions. Prob-ability theory is no different. Instead of events (subsets of sample space) we canoften find it easier to work with random variables (real-valued functions defined onsample space). You should think of a random variable as involving lots of differentevents, namely those events defined in terms of the random variable taking ondifferent sets of values. Accordingly we need to take care that the random variabledoesnt produce events which fall outwith our chosen -algebra. To do this weneed to develop the idea of a measurable function.

    15

  • Definition 3.23 (Measurable space): (,F) is a measurable space if F is a-algebra of subsets of .

    Definition 3.24 (Borel -algebra): The Borel -algebra B is the -algebra ofsubsets of R generated by the collection of intervals of R.

    In fact we dont need all the intervals of R. It is enough to take the closedhalf-infinite intervals (, x].

    Definition 3.25 (Measurable function): Suppose that (,F), (,F ) areboth measurable spaces. We say the function

    f :

    is measurable if f1(A) = { : f() A} belongs to F whenever A belongs to F .

    Definition 3.26 (Random variable): Suppose that X : R is measurableas a mapping from (,F) to (R,B). Then we say X is a random variable.

    As we have said, to each random variable there is a class of related events.This actually forms a -algebra.

    Definition 3.27 (-algebra generated by a random variable): If X : Ris a random variable then the -algebra generated by X is the family of events(X) = {X1(A) : A B}.

    3.6 Independence of random variablesRandom variables can be independent too! Essentially here independence meansthat a event generated by one of the random variables cannot be used to give usefulpredictions about an event generated by the other random variable.

    Definition 3.28 (Independence of random variables): We say random vari-ables X and Y are independent if their -algebras (X), (Y ) are independent.

    Theorem 3.29 (Criterion for independence of random variables): Let Xand Y be random variables, and let P be the -system of R formed by all half-infinite closed intervals (, x]. Then X and Y are independent if and only if thecollections of events X1P, Y 1P are independent*.

    3.7 Distributions of random variablesWe often need to talk about random variables on their own, without referenceto other random variables or events. In such cases all we are interested in is theprobabilities they have of taking values in various regions:

    * Here we define X1P = {X1(A) : A P} = {X1((, x]) : x R}

    16

  • Definition 3.30 (Distribution of a random variable): Suppose that X is arandom variable. Its distribution is the probability measure PX on R given by

    PX [B] = P [X B ]whenever B B.

    4. IntegrationOne of the main things to do with functions is to integrate them (find the areaunder the curve). One of the main things to do with random variables is to taketheir expectations (find their average values). It turns out that these are really thesame idea! We start with integration.

    4.1 Simple functions and IndicatorsBegin by thinking of the simplest possible function to integrate. That is an indi-cator function, which only takes two possible values, 0 or 1:

    Definition 4.31 (Indicator function): If A is a measurable set then its indica-tor function is defined by

    I[A](x) ={ 0 if x 6 A;

    1 if x A.

    The next stage up is to consider a simple function taking only a finite numberof values, since it can be regarded as a linear combination of indicator functions.

    Definition 4.32 (Simple functions): A simple function h is a measurable func-tion h : R which only takes finitely many values. Thus we can represent itas

    h(x) = c1I[A1](x) + ...cnI[An](x)

    for some finite collection A1, ..., An of measurable sets and constants c1, ..., cn.

    It is easy to integrate simple functions ...

    Definition 4.33 (Integration of simple functions): The integral of a simplefunction h with respect to a measure is given by

    hd =h(x)( dx) =

    ni=1

    ci(Ai)

    whereh(x) = c1I[A1](x) + ...cnI[An](x)

    17

  • as above.

    Note that one really should prove that the definition ofhd does not depend

    on exactly how one represents h as the sum of indicator functions.Integration for such functions has a number of basic properties which one uses

    all the time, almost unconsciously, when trying to find integrals.

    Theorem 4.34 (Properties of integration for simple functions):

    (1) if (f 6= g) = 0 then f d = g d;(2) Linearity:

    (af + bg) d = a

    f d+ b

    g d;

    (3) Monotonicity: f g means f d g d;(4) min{f, g} and max{f, g} are simple.

    Simple functions are rather boring. For more general functions we use limitingarguments. We have to be a little careful here, since some functions will haveintegrals built up from + where they are integrated over one part of the region,and over another part. Think for example of

    1x

    dx =

    0

    1x

    dx+ 0

    1x

    dx equals ?

    So we first consider just non-negative functions.

    Definition 4.35 (Integration for non-negative measurable functions): Iff 0 is measurable then we define

    f d = sup{

    g d : for simple g such that 0 g f}.

    4.2 Integrable functionsFor general functions we require that we dont get into this situation of .

    Definition 4.36 (Integration for general measurable functions): If f ismeasurable and we can write f = gh for two non-negative measurable functionsg and h, both with finite integrals, then

    f d =g d

    h d .

    We then say f is integrable.

    18

  • One really needs to prove that the integralf d does not depend on the

    choice f = g h. In fact if there is any choice which works then the easy choiceg = max{f, 0}h = max{f, 0}

    will work.One can show that the integral on integrable functions agrees with its defini-

    tion on simple functions and is linear. What starts to make the theory very easyis that the integral thus defined behaves very well when studying limits.

    Theorem 4.37 (Monotone convergence theorem (MON)): If fn f (allbeing non-negative measurable functions) then

    fn d f d .

    Corollary 4.38 (Integrability and simple functions): if f is non-negativeand measurable then for any sequence of non-negative simple functions fn suchthat fn f we have

    fn d f d .

    Definition 4.39 (Integration over a measurable set): if A is measurable andf is integrable then

    A

    f d = (

    I[A]f)

    d .

    4.3 Expectation of random variablesThe above notions apply directly to random variables, which may be thought ofsimply as measurable functions defined on the sample space!

    Definition 4.40 (Expectation): if P is a probability measure then we defineexpectation (with respect to this probability measure) for all integrable randomvariables X by

    E [X ] =X dP =

    X()P( d) .

    The notion of expectation is really only to do with the random variable con-sidered on its own, without reference to any other random variables. Accordinglyit can be expressed in terms of the distribution of the random variable.

    19

    The result MON is now longer true if the functions fn are not bounded below. EXAMPLE: fn(x) is minus infinity if x>n, zero if x

  • Theorem 4.41 (Change of variables): Let X be a random variable and letg : R R be a measurable function. Assuming that the random variable g(X) isintegrable,

    E [ g(X) ] =

    Rg(x)PX( dx) .

    4.4 ExamplesYou need to work through examples such as the following to get a good idea ofhow the above really works out in practice. See the material covered in lecturesfor more on this.

    Evaluate 10 xLeb( dx) = x. Consider = {1, 2, 3, ...}, P [ {i} ] = pi where i=1 pi = 1. Evaluatef dP =

    i=1 f(i)pi.

    Evaluate y0 exLeb( dx). Evaluate n0 f(x)Leb( dx) where

    f(x) =

    1 if 0 x < 1,2 if 1 x < 2,

    ...n if n 1 x < n.

    Evaluate I[0,](x) sin(x)Leb( dx).5. ConvergenceApproximation is a fundamental key to making mathematics work in practice.Instead of being stuck, unable to do a hard problem, we find an easier problemwhich has almost the same answer, and do that instead! The notion of convergence(see first-year analysis) is the formal structure giving us the tools to do this. Forrandom variables there are a number of different notions of convergence, dependingon whether we need to approximate a whole sequence of actual random values, orjust a particular random value, or even just probabilities.

    5.1 Convergence of random variables

    Definition 5.42 (Convergence in probability): The random variables Xnconverge in probability to Y ,

    Xn Y in prob ,

    20

  • if for all positive we have

    P [ |Xn Y | > ] 0 .

    Definition 5.43 (Convergence almost surely / almost everywhere): Therandom variables Xn converge almost surely to Y ,

    Xn Y a.s. ,

    if we haveP [Xn Y ] = 0 .

    The (measurable) functions fn converge almost everywhere to f if the set

    {x : fn(x) f(x) fails }

    is of Lebesgue measure zero.

    The difference is that convergence in probability deals with just a single ran-dom value Xn for large n. Convergence almost surely deals with the behaviour ofthe whole sequence. Here are some examples to think about.

    Consider random variables defined on ([0, 1],B,Leb) by Xn() =I[[0,1/n]](), Then Xn 0 a.s..

    Consider the probability space above and the events A1 = [0, 1],A2 = [0, 1/2], A3 = [1/2, 1], A4 = [0, 1/4], ..., A7 = [3/4, 1], ... ThenXn = I[An] converges to zero in probability but not almost surely.

    Suppose in the above that Xn =nk=1(k/n)I[[(k1)/n,k/n]]. Then

    Xn X a.s., where X( = [0, 1]. Suppose in the above that Xn a for all n. Let Yn = maxmnXm.

    Then Yn Y a.s. for some Y . Suppose in the above that the Xn are not bounded, but are inde-

    pendent, and furthermore

    lima

    i=1

    P [Xn a ] = 1 .

    Then Yn Y a.s. where

    P [Y a ] =i=1

    P [Xn a ] .

    As one might expect, the notion of almost sure convergence implies that ofconvergence in probability.

    21

    It is useful to bear in mind, the difference between convergence almost surely and in probability is often explored using the Borel-Cantelli lemmas

  • Theorem 5.44 (Almost sure convergence implies convergence in proba-bility): Xn X a.s. implies Xn X in prob.

    ALmost sure convergence allows for various theorems telling us when it isOK to exchange integrals and limits. Generally this doesnt work: consider theexample

    1 =

    0 exp(t) dt 6

    0

    lim

    exp(t) dt =

    0 dt = 0 .

    However we have already seen one case where it does work: when the limit inmonotonic. In fact we only need this to hold almost everywhere (i.e. when theconvergence is almost sure).

    Theorem 5.45 (MON): if the functions fn, f are non-negative and if fn f a.e. then

    fn d f d .

    It is often the case that the following simple inequalities are crucial to figuringout whether convergence holds.

    Lemma 5.46 (Markovs inequality): if f : R R is increasing and non-negative and X is a random variable then

    P [X a ] E [ f(X) ] /f(a)for all a such that f(a) > 0.

    Corollary 5.47 (Chebyshevs inequality): if E[X2] 0.

    In particular we can get a lot of mileage by combining with the fact, thatwhile in general the variance of a random variable is not additive, it is additive inthe case of independence.

    Lemma 5.48 (variance and independence): if a sequence of random variablesXi is independent then

    Var

    (ni=1

    Xi

    )=

    ni=1

    Var (Xi) .

    5.2 Laws of large numbers for random variablesAn important application of these ideas is to show that the law of large numbers

    extends from events to random variables.

    22

  • Theorem 5.49 (Weak law of large numbers): if a sequence of random vari-ables Xi is independent, and if the random variables all have the same finite meanand variance E [Xi ] = and Var(Xi) = 2
  • Corollary 5.53 (Dominated convergence theorem (DOM)): If the func-tions fn : R R are bounded above in absolute value by g a.e. (so |fn| < g a.e.)and g is integrable and also fn f then

    limfn d =

    f d .

    This is a very powerful result ...

    5.5 Examples

    If the Xn form a bounded sequence random variable and they con-verge almost surely to X then

    E [Xn ] E [X ] .

    Suppose that U is a random variable uniformly distributed over [0, 1]and

    Xn =2n1k=0

    k2nI[k2nU

  • Definition 6.54 (Product measure space): define the product measure on the -system R of rectangle sets AB as above.

    Let A(R) be the algebra generated by R.Lemma 6.55 (Representation of A(R)): every member of A(R) can be ex-pressed as a finite disjoint union of rectangle sets.

    It is now possible to apply the Extension Theorem (we need to check -additivity this is non-trivial but works) to define the product measure on the whole -algebra (R).

    6.2 Fubinis theoremThere are three big results on integration. We have already met two: MON andDOM, which tell us cases when we can exchange integrals and limits. The otherresult arises in the situation where we have a product measure space. In such acase we can integrate any function in one of three possible ways: either using theproduct measure, or by first doing a partial integration holding one coordinatefixed, and then integrating with respect to that one. We call this alternativeiterated integration, and obviously there are two ways to do it depending on whichvariable we fix first. The final big result is due to Fubini, and tells us that as longas the function is modestly well-behaved it doesnt matter which of the three wayswe do the integration, we still get the same answer:

    Theorem 6.56 (Fubinis theorem): Suppose f is a real-valued function definedon the product measure space above which is either (a) non-negative or (b) -integrable. Then

    f d( ) =

    (f(, )( d)

    )( d)

    Notice the two alternative conditions. Non-negativity (sometimes describedas Tonellis condition, is easy to check but can be limited. Think carefully aboutFubinis theorem and especially Tonellis condition, and you will see that the onlything which can go wrong is when in the product form you have an problem!

    6.3 Relationship with independenceSuppose X and Y are independent random variables. Then the distribution of thepair (X,Y ), a measure on RR given by

    (A) = P [ (X,Y ) A ] ,is exactly the product measure where is the distribution of X, and is thedistribution of Y .

    End of outline notes

    25

    1. Introduction1.1 Books1.2 Resources (including examination information)1.3 Motivating Examples

    2. Probabilities, algebras, and $sigma $-algebras2.1 Motivation2.2 Revision of sample space and events2.3 Algebras of sets2.4 Limit Sets2.5 $sigma $-algebras2.6 Countable additivity2.7 Uniqueness of probability measures2.8 Lebesgue measure and coin tossing

    3. Independence and measurable functions3.1 Independence3.2 Borel-Cantelli lemmas3.3 Law of large numbers for events3.4 Independence and classes of events3.5 Measurable functions3.6 Independence of random variables3.7 Distributions of random variables

    4. Integration4.1 Simple functions and Indicators4.2 Integrable functions4.3 Expectation of random variables4.4 Examples

    5. Convergence5.1 Convergence of random variables5.2 Laws of large numbers for random variables5.3 Convergence of integrals and expectations5.4 Dominated convergence theorem5.5 Examples

    6. Product measures6.1 Product measure spaces6.2 Fubini's theorem6.3 Relationship with independence