introduction to probability theory and statistics [matzinger]

65
INTRODUCTION TO PROBABILITY THEORY AND STATISTICS HEINRICH MATZINGER Georgia Tech E-mail: [email protected] October 7, 2014 Contents 1 Definition and basic properties 2 1.1 Events ...................................... 2 1.2 Frequencies ................................... 4 1.3 Definition of probability ............................ 5 1.4 Direct consequences ............................... 6 1.5 Some inequalities ................................ 10 2 Conditional probability and independence 11 2.1 Law of total probability ............................ 12 2.2 Bayes rule .................................... 13 3 Expectation 14 4 Dispersion, average fluctuation and standard deviation 18 4.1 Matzingers rule of thumb ............................ 20 5 Calculation with the Variance 21 5.1 Getting the big picture with the help of Matzinger’s rule of thumb ..... 23 6 Covariance and correlation 24 6.1 Correlation ................................... 26 7 Chebyshev’s and Markov’s inequalities 26 8 Combinatorics 29 1

Upload: muhammad-rizwan

Post on 11-Sep-2015

27 views

Category:

Documents


5 download

DESCRIPTION

Lecture Notes

TRANSCRIPT

  • INTRODUCTION TO PROBABILITYTHEORY AND STATISTICS

    HEINRICH MATZINGERGeorgia Tech

    E-mail: [email protected]

    October 7, 2014

    Contents

    1 Definition and basic properties 2

    1.1 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Definition of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Direct consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Some inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2 Conditional probability and independence 11

    2.1 Law of total probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Bayes rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3 Expectation 14

    4 Dispersion, average fluctuation and standard deviation 18

    4.1 Matzingers rule of thumb . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    5 Calculation with the Variance 21

    5.1 Getting the big picture with the help of Matzingers rule of thumb . . . . . 23

    6 Covariance and correlation 24

    6.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    7 Chebyshevs and Markovs inequalities 26

    8 Combinatorics 29

    1

  • 9 Important discrete random variables 31

    9.1 Bernoulli variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319.2 Binomial random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 329.3 Geometric random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    10 Continuous random variables 34

    11 Normal random variables 36

    12 Distribution functions 38

    13 Expectation and variance for continuous random variables 40

    14 Central limit theorem 42

    15 Statistical testing 44

    15.1 Looking up probabilities for the standard normal in a table . . . . . . . . . 4615.2 Two sample testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    16 Statistical estimation 53

    16.1 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5316.2 Estimation of variance and standart deviation . . . . . . . . . . . . . . . . 5516.3 Maximum Likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 5616.4 Estimation of parameter for geometric random variables . . . . . . . . . . 57

    17 Linear Regression 58

    17.1 The case where the exact linear model is known . . . . . . . . . . . . . . . 5817.2 When and are not know . . . . . . . . . . . . . . . . . . . . . . . . . 6017.3 Where the formula for the estimates of and come from . . . . . . . . . 6217.4 Expectation and variance of . . . . . . . . . . . . . . . . . . . . . . . . 6417.5 How precise are our estimates . . . . . . . . . . . . . . . . . . . . . . . . . 6517.6 Multiple factors and or polynomial regression . . . . . . . . . . . . . . . . 6517.7 Other applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    1 Definition and basic properties

    1.1 Events

    Imagine that we throw a die which has 4 sides. The outcome, of this experiment will beone of the four numbers: 1,2,3 or 4. The set of all possible outcomes in this case is:

    = {1, 2, 3, 4} . is called the outcome space or sample space. Before doing the experiment we dont knowwhat the outcome will be. Each possible outcome has a certain probability to occur. This

    2

  • die-experiment is a random experiment.We can use our die to make bets. Somebody might bet that the number will be even. Wethrow the die: if the number we see is 2 or 4 we say that the event even has occurredor has been observed. We can identify the event even with the set: {2, 4}. This mightseem a little bit abstract, but by identifying the event with a set, events become easierto handle: Sets are well known mathematical objects, whilst the events as we know themfrom every day language are not.In a similar way one might bet that the outcome is a number greater-equal to 3. Thisevent is realized when we observe a 3 or a 4. The event greater or equal 3 can thus beviewed as the set {3, 4}.Another example, is the event odd. This is the set: {1, 3}.With this way of looking at things, events are simply subsets of . Take another example:a coin with a side 0 and a side 1. The outcome space or sample space in that case is:

    = {0, 1} .The events are the subsets of , in this case there are 4 of them:

    , {0}, {1}, {0, 1}.Example 1.1 It might at first seem very surprising that events can be viewed as sets.Consider for example the following sets:the set of bicycles which belong to a Ga tech student, the set of all sky-scrapers in Atlanta,the set of all one dollar bills which are currently in the US.Let us give a couple of events:the event that after X-mas the unemployment rate is lower than now, the event that ourfavorite pet dies from hart attack, the event that I go down with flue next week.At first, it seems that events are something very dierent from sets. Let us see in a realworld example how mathematicians view events as sets:Assume that we are interested in where the American economy is going to stand in exactlyone year from now. More specifically, we look at unemployment and inflation and wonderif they will be above or below their current level. To describe the situation which weencounter in a year from now, we introduce a two digit variable Z = XY . Let X be equalto one if unemployment is higher in a year than its current level. If it is lower, let X beequal to 0. Similarly, let Y be equal to one if inflation is higher in a year from now. If itis lower, let Y be equal to zero. The possible outcomes for Z are:

    00, 01, 10, 11.

    This is the situation of a random experiment, where the outcome is one of the four possiblenumbers: 00,01,10,11. We dont know what the outcome will be. But each possibility canoccur with a certain probability. Let A be the event that unemployment is higher in a year.This corresponds to the outcomes 10 and 11. We thus identify the event A with the set:

    {10, 11} .

    3

  • Let B be the event that inflation is higher in a year form now. This corresponds to theoutcomes 01 and 11. We thus view the event B as the set:

    {01, 11} .Recall that the intersection AB of two sets A and B, is the set consisting of all elementscontained in both A and B. In our example, the intersection of A and B is equal toA B = {11}. Let C designate the event that unemployment goes up and that inflationgoes up at the same time. This corresponds to the outcome 11. Thus, C is identified withthe set: {11}. In other words, C = A B. The general rule which we must remember is:For any events A and B, if C designates the event that A and B both occur atthe same time, then C = A B.Let D be the event that unemployment or inflation will be up in a year from now. (By orwe mean that at least one of them is up.) This corresponds to the outcomes: 01,10,11.Thus D gets identified with the set:

    D = {01, 10, 11} .Recall that the union of two sets A and B is defined to be the set consisting of all elementswhich are in A or in B. We see in our example that D = AB. This is true in general.We must thus remember the following rule:For any events A and B, if D designates the event that A or B occur, thenD = A B.

    1.2 Frequencies

    Assume that we have a six sided die. In this case the outcome space is

    = {1, 2, 3, 4, 5, 6}.The event even in this case is the set:

    {2, 4, 6}whilst the event odd is equal to

    {1, 3, 5} .Instead of throwing the die only once, we throw it several times. As a result, instead ofjust a number, we get a sequence of numbers. When throwing the six-sided die I obtainedthe sequence:

    1, 4, 3, 5, 2, 6, 3, 4, 5, 3, . . .

    When repeating the same experiment which consists in throwing the die a couple of times,we are likely to obtain another sequence. The sequence we observe is a random sequence.

    4

  • In this example we observe one 3 within the first 5 trials and three 3s occurring withinthe first 10 trials. We write:

    n{3}

    for the number of times we observe a 3 among the first n trials. In our example thus: forn = 5 we have n{3} = 1 whilst for n = 10 we find n{3} = 3.Let A be an event. We denote by nA the number of times A occurred up to time n.Take for example A to be the event even. In the above sequence within the first 5 trialswe obtained 2 even numbers. Thus for n = 5 we have that nA = 2. Within the first 10trials we found 4 even numbers. Thus, for n = 10 we have nA = 4. The proportion ofeven numbers nA/n for the first 5 trials is equal to 2/5 = 40%. For the first 10 trials, thisproportion is 4/10 = 40%.

    1.3 Definition of probability

    The basic definition of probability which we use is based on frequencies. For our definitionof probability we need an assumption about the world surrounding us:Let A designate an event. When we repeat the same random experiment independentlymany times we observe that on the long run the proportion of times A occurs tends tostabilize. Whenever we repeat this experiment, the proportion nA/n on the long runtends to be the same number. A more mathematical way of formulating this, is to staythat nA/n converges to a number only depending on A, as n tends to infinity. This is ourbasic assumption.

    Assumption As we keep repeating the same random experiment under the same con-ditions and such that each trial is independent of the previous ones, we find that:the proportion nA/n tends to a number which only depends on A, as n.We are now ready to give our definition of probability:

    Definition 1.1 Let A be an event. Assume that we repeat the same random experimentunder exactly the same conditions independently many times. Let nA designate the numberof times the event A occurred within the n first repeats of the experiment. We define theprobability of the event A to be the real number:

    P (A) =: limn

    nAn.

    Thus, P (A) designates the probability of the event A. Take for example a four-sidedperfectly symmetric die. Because, of symmetry each side must have same probability. Onthe long run we will see a forth of the times a 1, a forth of the times a 2, a forth of thetimes a 3 and a forth of the times a 4. Thus, for the symmetric die the probability ofeach side is 0.25.

    5

  • 1.4 Direct consequences

    From our definition of probability there are several useful facts, which follow immediately:

    1. For any event A, we have that:P (A) 0.

    2. For any event A, we have that:P (A) 1.

    3. Let designate the state space. Then:

    P () = 1.

    Let us prove these elementary facts:

    1. By definition na/n 0. However, the limit of a sequence which is 0 is also 0.Since P (A) is by definition equal to the limit of the sequence nA/n we find thatP (A) 0.

    2. By definition nA n. It follows that na/n 1. The limit of a sequence whichis always less or equal to one must also be less or equal to one. Thus, P (A) =limn na/n 1.

    3. By definition n = n. Thus:

    P () = limn

    n/n = limn

    n/n = limn

    1 = 1.

    The next two theorems are essential for solving many problems:

    Theorem 1.1 Let A and B be disjoint events. Then:

    P (A B) = P (A) + P (B).Proof. Let C be the event C = AB. C is the event that A or B has occurred. BecauseA and B are disjoint, we have that A and B can not occur at the same time. Thus, whenwe count up to time n how many times C has occurred, we find that this is exactly equalto the number of times A has occurred plus the number or times B has occurred. In otherwords,

    nC = nA + nB. (1.1)

    From this it follows that:

    P (C) = limn

    nCn

    = limn

    nA + nBn

    = limn

    (nAn

    +nBn

    ).

    We know that the sum of limits is equal to the limit of the sum. Applying this to theright side of the last equality above, yields:

    limn

    (nAn

    +nBn

    )= lim

    n

    nAn

    + limn

    nBn

    = P (A) + P (B).

    6

  • This finishes to prove that

    P (C) = P (A B) = P (A) + P (B).

    Let us give an example which might help us to understand why equation 1.1 holds.Imagine we are using a 6-sided die. Let A be the event that we observe a 2 or a 3. ThusA = {2, 3}. Let B be the event that we observe a 1 or a 5. Thus, B = {1, 5}. The twoevents A and B are disjoint: it is not possible to observe at the same time A and B sinceA B = . Assume that we throw the die 10 times and obtain the sequence of numbers:

    1, 3, 4, 6, 3, 4, 2, 5, 1, 2.

    We have seen the event A four times: at the second, fifth, seventh and tenth trial. Theevent B is observed at the first trial, at the eight and ninth trials. C = AB = {2, 3, 1, 5}is observed at the trials number: 2,5,7,10 and 1,8,9. We thus find in this case that nA = 4,nB = 3 and nC = 7 which confirms equation 1.1.

    Example 1.2 Assume that we are throwing a fair coin with sides 0 and 1. Let Xi des-ignate the number which we obtain when we flip the coin for the i-th time. Let A be theevent that we observe right at the beginning the number 111. In other words:

    A = {X1 = 1, X2 = 1, X3 = 1}.Let B designate the event that we observe the number 101 when we read our randomsequence starting from the second trial. Thus:

    B = {X2 = 1, X3 = 0, X4 = 1}.Assume that we want to calculate the probability to observe that at least one of the twoevents A or B holds. In other words we want to calculate the probability of the eventC = A B.Note that A and B can not occur both at the same time. The reason is that for A to holdit is necessary that X3 = 1 and for B to hold it is necessary that X3 = 0. X3 howevercan not be equal at the same time to 0 ant to 1. Thus, A and B are disjoint events, sowe are allowed to use theorem 1.1.We find, applying theorem 1.1 that:

    P (A B) = P (A) + P (B).With a fair coin, each 3-digit number has same probability. There are 8, 3-digit numbersso each one has probability 1/8. It follows that P (A) = 1/8 and P (B) = 1/8. Thus

    P (A B) = 18+

    1

    8=

    1

    4= 25%

    7

  • The next theorem is useful for any pair of events A and B and not just disjoint events:

    Theorem 1.2 Let A and B be two events. Then:

    P (A B) = P (A) + P (B) P (A B).Proof. Let C = AB. Let D = BA, that is D consists of all the elements that are inB, but not in A. We have by definition that C = D A and that D and A are disjoint.Thus we can apply theorem 1.1 and find:

    P (C) = P (A) + P (D) (1.2)

    Furthermore (A B) and D are disjoint, and we have B = (A B) D. We can thusapply theorem 1.1 and find that:

    P (B) = P (A B) + P (D) (1.3)Subtracting equation 1.3 from equation 1.2 yields:

    P (C) P (B) = P (A) P (A B).By adding P (B) on both sides of the last equation, we find:

    P (C) = P (A B) = P (A) + P (B) P (A B).This finishes this proof.

    Problem 1.1 Let a and b designate two genes. Let the probability that a randomly pickedperson in the US, has gene a be 20%. Let the probability for gene b be 30%. And eventually,let the probability that he has both genes at the same time be 10%. What is the probabilityto have at least one of the two genes?

    Let us explain how we solve the above problem: Let A, resp. B designate the event thatthe randomly picked person has gene a, resp. b. We know that:

    P (A) = 20% P (B) = 30% P (A B) = 10%

    The event to have at least one gene is the event A B. By theorem 1.2 we have that:P (AB) = P (A)+P (B)P (AB). Thus in our case: P (AB) = 20%+30%10% =40%. This finishes to solve the above problem.In many situations we will be considering the union of more of 3 events. The next theoremgives the formula for three events:

    8

  • Theorem 1.3 Let A, B and C be three events. Then we have

    P (ABC) = P (A)+P (B)+P (C)P (AB)P (AC)P (BC)+P (ABC).Proof. We already now the formula for the probability of the union of two events. So,we are going to use this formula. Let D denote the union: D := B C. Then we find

    A B C = A Dand hence

    P (A B C) = P (A D). (1.4)By theorem 1.2, the right side of the last equation above is equal to:

    P (A D) = P (A) + P (D) P (A D) = P (A) + P (B C) P (A (B C)) (1.5)Note that by theorem 1.2 we have:

    P (B C) = P (B) + P (C) P (B C).We have:

    A (B C) = (A B) (A C)and hence

    P (A (B C)) = P ((A B) (A C)). (1.6)But the right side of the last equation above is the probability of the union of two eventsand hence theorem 1.2 applies:

    P ((AB)(AC)) = P (AB)+P (AC)P ((AB)(AC)) = P (AB)+P (AC)P ((ABC)).(1.7)

    Combining now equation 1.4, 1.5, 1.6 with 1.7, we find

    P (ABC) = P (A)+P (B)+P (C)P (AB)P (AC)P (BC)+P (ABC).

    Often it is easier to calculate the probability of a complement than the probability of theevent itself. In such a situation, the following theorem is useful:

    Theorem 1.4 Let A be an event and let Ac denote its complement. Then:

    P (A) = 1 P (Ac)Proof. Note that the events A and Ac are disjoint. Furthermore by definition AAc = .Recall that for the sample space , we have that P () = 1. We can thus apply theorem1.1 and find that:

    1 = P () = P (A Ac) = P (A) + P (Ac).This implies that:

    P (A) = 1 P (Ac)which finishes this proof.

    9

  • 1.5 Some inequalities

    Theorem 1.5 Let A and B be two events. Then:

    P (A B) P (A) + P (B)

    Proof. We know by theorem 1.2 that

    P (A B) = P (A) + P (B) P (A B)Since P (A B) 0 we have that

    P (A) + P (B) P (A B) P (A) + P (B).It follows that

    P (A B) P (A) + P (B).For several events a similar theorem holds:

    Theorem 1.6 Let A1, . . . , An be a collection of n events. Then

    P (A1 A2 . . . An) A1 + . . .+ AnProof. By induction.

    Another often used inequality is:

    Theorem 1.7 Let A B. Then:P (A) P (B).

    Proof. If A B, then for every n we have that:nA nB

    hence alsonAn nB

    n.

    Thus:limn

    nAn lim

    n

    nBn.

    HenceP (A) P (B).

    10

  • 2 Conditional probability and independence

    Imagine the following situation: in a population there are two illnesses a and b. Weassume that 20% suer from b, 15% suer from a whilst 10% suer from both. Let A bethe event that a person suers from a and let B be the event that a person suers from b.If a patient comes to a doctor and says that he suers from illness b, how likely is he tohave illness a also? (We assume that the patient has been tested for b but not yet testedfor a.) We note that half the population group suering from b, also suer from a. Hence,when the doctor meets such a patients suering from b, there is a chance of 1 out of 2,that the person suers also from a. This is called the conditional probability of B givenA and denoted by P (B|A). The formula we used is 10%/20% = P (A B)/P (A).Definition 2.1 Let A,B be two events. Then we define the probability of A conditionalon the event B, and write P (A|B) for the number:

    P (A|B) := P (A B)P (B)

    .

    Definition 2.2 Let A,B be two events. We say that A and B are independent of eachother i

    P (A B) = P (A) P (B).

    Note that A and B are independent of each other if and only if P (A|B) = P (A). In otherword, A and B are independent of each other if and only if the realization of one of theevents does not aect the conditional probability of the other.Assume that we perform two random experiments independently of each other, in thesense that the two experiments do not interact. That is the experiments have no influenceon each other. Le A denote an event related to the first experiment, and let B denotean event related to the second experiment. We saw in class that in this situation theequation P (AB) = P (A) P (B) must hold. And thus, A and B are independent in thesense of the above definition. To show this we used an argument where we simulated thetwo random experience by picking marbles from two bags.There are also many cases, where events related to a same experiment are independent,in the sense of the above definition. For example for a fair die, the events A = {1, 2} andB = {2, 4, 6} are independent.There can also be more than two independent events at a time:

    Definition 2.3 Let A1, A2, . . . , An be a finite collection of events. We say that A1, A2, . . . , Anare all independent of each other i

    P (iIAi) = iIP (Ai)for every subset I {1, 2, . . . , n}.The next example is very important for the test on Wednesday.

    11

  • Example 2.1 Assume we flip the same coin independently three times. Let the coin bebiased, so that side 1 has probability 60% and side 0 has probability 40%. What is theprobability to observe the number 101? (By this we mean: what is the probability to firstget a 1, then a 0 and eventually, at the third trial, a 1 again?)To solve this problem let A1, resp. A3 be the event that at the first, resp. third trial weget a one. Let A2 be the event that at the second trial we get a zero. Observing a 101is thus equal to the event A := A1 A2 A3. Because, the trials are performed in anindependent manner it follows that the events A1, A2, A3 are independent of each other.Thus we have that:

    P (A1 A2 A3) = P (A1) P (A2) P (A3).We have that:

    P (A1) = 60%, P (A2) = 40%, P (A3) = 60%.

    It follows that:P (A1 A2 A3) = 60% 40% 60% = 0.144.

    2.1 Law of total probability

    Lemma 2.1 Let A and B be two events. Then

    P (A) = P (A B) + P (A Bc). (2.1)Furthermore, if B and Bc have both probabilities that are not equal to zero, then

    P (A) = P (A|B) P (B) + P (A|Bc)P (Bc). (2.2)

    Proof. Let D be the event D := AB. Let E be the event E := ABc. Then we havethat D and E are disjoint. Furthermore, A = E D so that by theorem 1.1, we find:

    P (A) = P (E D) = P (E) + P (D).Replacing E and D by A B and A Bc, yields equation 2.1.We can use 2.1 to find

    P (A) =P (A B) P (B)

    P (B)+P (A Bc) P (Bc)

    P (Bc).

    The right side of the last equality above is equal to

    P (A) = P (A|B)P (B) + P (A|Bc)P (Bc),which finishes to prove equation 2.2

    Let us give an example which should show that intuitively this law is very clear. Assumethat in a town 90% of women are blond but only 20% of men. Assume we chose a person

    12

  • at random from this town. Each person is equally likely to be drawn. Let W be the eventthat the person be a women and B be the event that the person be blond. The law oftotal probability can be written as

    P (B) = P (B|W )P (W ) + P (B|W c)P (W c). (2.3)In our case, the conditional probability of blond conditional on women is P (B|W ) = 0.9.On the other hand W c is the event to draw a male and P (B|W c) is the conditionalprobability to have a blond given that the person is a man. In our case, P (B|W c) = 0.2.So, when we put the numerical values into equation 2.3, we find

    P (B) = 0.9P (W ) + 0.2P (W c). (2.4)

    Here P (W ) is the probability that the chosen person is a women. This is then thepercentage of women in this population. Similarly, P (W c) is the proportion of men. Inother words, equation 2.4 can be read as follows: the total proportion of blonds in thepopulation is the weighted average between the proportion of blonds among the femaleand the male population.

    2.2 Bayes rule

    Bayes rule is useful when one would like to calculate a conditional probability of A givenB, but one is given the opposite, that is the probability of B given A. Let us next stateBayes rule:

    Lemma 2.2 Let A and B be events both having non zero probabilities. Then

    P (A|B) = P (B|A) P (A)P (B)

    . (2.5)

    Proof. By definition of conditional probability we have P (B|A) = P (B A)/P (A). Weare now going to plug the last expression into the right side of equation 2.5. We find:

    P (B|A) P (A)P (B)

    =P (A B)P (A)P (A) P (B) =

    P (A B)P (B)

    = P (A|B),

    which establishes equation 2.5.

    Let us give an example. Assume that 30% of men are interested in car races, but only 10%of women are. If I know that a person is interested in car races, what is the probabilitythat it is a man? Again I imagine that we pick a person at random in the population.Let M be the event that the person is a man and C the event that she/he is interestedin car races. We know P (C|M) = 0.3 and P (C|M c) = 0.1. Now by Bayes rule we havethat the conditional probability that the person is a man given that he/she is interestedin car races is:

    P (M |C) = P (C|M) P (M)P (C)

    . (2.6)

    13

  • We have that P (C) = P (C|M)P (M) + P (C|M c)P (M c) which we can plug into 2.6 tofind

    P (M |C) = P (C|M) P (M)P (C|M)P (M) + P (C|M c)P (M c) .

    In the present numerical example, we find

    P (M |C) = 0.3 P (M)0.3P (M) + 0.1P (M c)

    ,

    where P (M) represents the proportion of men in the population, whilst P (M c) representsthe proportion of women.

    3 Expectation

    Imagine a firm which every year makes a profit. It is not known in advance what the profitof the firm is going to be. This means that the profit is random: we can assign to eachpossible outcome a certain probability. Assume that from year to year the probabilitiesfor the profit of our firm do not change. Assume also that from one year to the next theprofits are independent. What is the long term average yearly profit equal to?For this let us look at a specific model. Assume the firm could make 1, 2, 3 or 4 millionprofit with the following probabilities

    P (X = x) 0.1 0.4 0.3 0.2x 1 2 3 4

    (The model here is not very realistic since there are only a few possible outcomes. Wechose it merely to be able to illustrate our point). Let Xi denote the profit in year i.Hence, we have that X, X1, X2,... are i.i.d. random variables.To calculate the long term average yearly profit consider the following. In 10% of theyear on the long run we get 1 million. If we take a period of n years, where n is large,we thus find that in about 0.1n years we make 1 million. In 40% of the years we make2 millions on the long run. Hence, in a period of n years, this means that in about 0.4nyears we make 2 millions. This corresponds to an amount of money equal to about 0.4ntimes 2 millions. Similarly, for n large the money made during the years where we earned3 million is about 3 0.3n, whilst for the years where we made 4 millions we get 4 0.2n.The total during this n year period is thus about

    10.1+20.4+30.3+40.4 = 1P (X = 1)+2P (X = 2)+3P (X = 3)+4P (X = 4) == 3.3Hence, on the long run the yearly average profit is 3.3 millions. This long term average iscalled expected value or expectation and is denoted by E[X]. Let us formalize this concept:In general if X denotes the outcome of a random experiment, then we call X a randomvariable.

    14

  • Definition 3.1 Let us consider a random experiment with a finite number of possibleoutcomes, where the state space is

    = {x1, x2, . . . , xs} .(In the profit example above, we would have = {1, 2, 3, 4}.) Let X denote the outcomeof this random experiment. For x , let px denote the probability that the outcome ofour random experiment is is x. That is:

    px := P (X = x).

    (In the last example above, we have for example p1 = 0.1 and p2 = 0.4...) We define theexpected value E[X]:

    E[X] :=x

    xpx.

    In other words, to calculate the expected value of a random variable, we simply multiplethe probabilities with the corresponding values and then take the sum over all possibleoutcomes. Let us see yet another example for expectation.

    Example 3.1 Let X denote the value which we obtain when we throw a fair coin withside 0 and side 1. Then we find that:

    E[X] = 0.5 1 + 0.5 0 = 0.5When we keep repeating the same random experiment independently and under the sameconditions on the long run, we will see that the average value which we observe converges tothe expectation. This is what we saw in the firm/profit example above. Let us formalizethis. This fact is actually a theorem which is called the Law of Large Numbers. Thistheorem goes as follows:

    Theorem 3.1 Assume we repeat the same random experiment under the same conditionsindependently many times. Let Xi denote the (random variable) which is the outcome ofthe i-th experiment. Then:

    limn

    (X1 +X2 + . . .+Xn)

    n= E[X1] (3.1)

    This simply means that on the long run, the average is going to be equal to to the expec-tation.

    Proof. Let denote the state space of the random variables Xi:

    = {x1, x2, . . . , xs} .by regrouping the same terms together, we find:

    X1 +X2 + . . .+Xn = x1nx1 + x2nx2 + . . .+ xsnxs.

    15

  • (Remember that nxi denotes the number of times we observe the value xi in the finitesequence: X1, X2, . . . , Xn.) Thus:

    limn

    (X1 +X2 + . . .+Xn)

    n= lim

    n

    (x1nx1n

    + . . .+ xsnxsn

    ).

    By definition

    P (X1 = xi) = limn

    nxin.

    Since the limit of a sum is the sum of the limits we find,

    limn

    (x1nx1n

    + . . .+ xsnxsn

    )= x1 lim

    n

    nx1n

    + . . .+ xs limn

    nxsn

    =

    =x1P (X = x1) + . . .+ xsP (X = xs) = E[X1].

    So, we can now generalize our firm profit example. Imagine for this that the profit a firmmakes every month is random. Imagine also that the earnings from month to month areindependent of each other and also have the same probabilities. In this case we canview the sequence of earnings month for month, as a sequence of repeats of the samerandom experiment. Because of theorem 3.1, on the long run the average monthly incomewill be equal to the expectation.Let us next give a few useful lemmas in connection with expectation. The first lemma dealswith the situation where we take an i.i.d. sequence of random outcomes X1, X2, X3, . . .and multiply each one of them with a constant a. Let Yi denote the number Xi multipliedby a: hence Yi := aXi. Then the long term average of the Xis multiplied by a equals tothe long term average of the Yis. Let us state this fact in a formal way:

    Lemma 3.1 Let X denote the outcome of a random experiment. (Thus X is a so-calledrandom variable.) Let a be a real (non-random) number. Then:

    E[aX] = aE[X].

    Proof. Let us repeat the same experiment independently many times. Let Xi denotethe outcome of the i-th trial. Let Yi be equal to Yi := aXi. Then by the law of largenumbers, we have that

    limn

    Y1 + . . .+ Ynn

    = E[Y1] = E[aX1].

    However:

    limn

    Y1 + . . .+ Ynn

    = limn

    aX1 + . . .+ aXnn

    = limn

    a

    (X1 + . . .+Xn

    n

    )=

    =a limn

    X1 + . . .+Xnn

    = aE[X1].

    16

  • This proves that E[aX1] = aE[X1] and finishes this proof. The next lemma isextremely important when dealing with the expectation of sums of random variables. Itstates that the sum of the expectation is equal to the expectation of the sum. We canthink of a simple real life example which shows why this should be true. Imagine thatMatzinger is the owner of two firm (wishful thinking since Matzinger is a poor professor).Let Xi denote the profit made by his first firm in year i. Let Yi denote the profit madeby his second firm in year i. We assume that from year to year the probabilities do notchange for both firms and the profits are independent (from year to year). In other wordsX,X1, X2, X3, . . . are i.i.d. variables and so are Y, Y1, Y2, . . .. Let Zi denote the total profitMatzinger makes in year i, so that Zi = Xi + Yi. Now obviously the long term averageyearly profit of Matzinger is the long term average yearly profit form the first firm plusthe long term average profit from the second firm. In mathematical writing this gives:

    E[X + Y ] = E[X] + E[Y ].

    As a matter of fact, E[X + Y ] denotes the long term average profit of Matzinger. Onthe other hand, E[X] denotes the average profit of the first firm, whilst E[Y ] denotes theaverage profit of the second firm.Let us next formalize all of this:

    Lemma 3.2 Let X, Y denote the outcomes of two random experiments.Then:

    E[X + Y ] = E[X] + E[Y ].

    Proof. Let us repeat the two random experiments independently many times. Let Xidenote the outcome of the i-th trial of the first random experiment. Let Yi be equal to theoutcome of the i-th trial of the second random experiment. For all i N, let Zi := Xi+Yi.Then by the law of large numbers, we have that:

    limn

    Z1 + . . .+ Znn

    = E[Z1] = E[X1 + Y1].

    However:

    limn

    Z1 + . . .+ Znn

    = limn

    X1 + Y1 +X2 + Y2 + . . .+Xn + Ynn

    =

    = limn

    ((X1 + . . .+Xn) + (Y1 + . . .+ Yn)

    n

    )=

    = limn

    X1 + . . .+Xnn

    + limn

    Y1 + . . .+ Ynn

    + = E[X1] + E[Y1].

    This proves that E[X1+Y1] = E[X1]+E[Y1] and finishes this proof. It is very importantto note that we do not need for the above theorem to have X and Y being independentof each other.

    17

  • Lemma 3.3 Let X, Y denote the outcomes of two independent random experiments.Then:

    E[X Y ] = E[X] E[Y ].Proof. We assume that X takes values in a countable set x, whilst Y takes on valuesfrom the countable set Y . We have that

    E[XY ] =

    xX ,yY

    xyP (X = x, Y = y). (3.2)

    By independence of X and Y , we have that P (X = x, Y = y) = P (X = x)P (Y = y).Plugging the last equality into 3.2, we find

    E[XY ] =

    xX ,yY

    xyP (X = x)P (Y = y) =xX

    xP (X = x)yY

    yP (Y = y) = E[X]E[Y ]

    So we have proven that E[XY ] = E[X] E[Y ].

    4 Dispersion, average fluctuation and standard devi-

    ation

    In some problems we are only interested in the expectation of a random variable. Forexample, consider insurance policies for mobile telephones sold by a big phone company.Say Xi is the amount which will be paid during the coming year to the i-th customerdue to his/her phone breaking down. It seems reasonable to assume that the Xis areindependent of each other. (We assume no phone viruses). We also assume that they allfollow the same random model. So, by the Law of Large Numbers we have that for nlarge, the average is approximately equal to the expectation:

    X1 +X2 + . . .+Xnn

    E[Xi].

    Hence, when n is really large, there is no risk involved for the phone company: theyknow how much they will have to pay total: on a per customer basis, they will have tospend an amount very close to E[X1]. In other words, they only need one real numberfrom the probability model for the claims: that is the expectation E[Xi]. Now, in manyother applications knowing only the expected value will not be enough: we will also needa measure of the dispersion. This means that we will also want to know how much onaverage the variables fluctuate from their long term average E[X1].Let us give an example. Matzinger as a child used to walk with his mother every day on the shoresof Lake Geneva. Now, there is a place where there is a scale to measure the height of the water. So,hydrologists measure the water level and then analyze this data. Assume that Xi denotes the water levelon a specific day day in year i. (We assume that we always measure on the same day of the year, like forexample on the first of January). For the current discussion we assume that the model does not change

    18

  • over time (no global warming). We furthermore assume that from one year to the next the values areindependent. Say the random model would be given as follows:

    x 4 5 6 7 8 9P (X = x) 1

    6

    1

    6

    1

    6

    1

    6

    1

    6

    1

    6

    How much does the water level fluctuate on average from year to year? Note that the long term average,that is the expectation is equal to

    E[Xi] = 4 16+ 5

    1

    6+ 6

    1

    6+ 7

    1

    6+ 8

    1

    6+ 9

    1

    6= 6.5

    Now, when the water level is 6 or 7, then we are 0.5 away from the long term average of 6.5. In such ayear i, we will say that the fluctuation fi is 0.5. In other words, we measure for each year i, how far weare from E[Xi]. This observed fluctuation in year i is then equal to

    fi := |Xi 6.5| = |Xi E[Xi]|.In our model, fi = 0.5 happens with a probability of 1/3, that is on the long run, in one third of theyears. When the water level is either at 8 or 5, then we are 1.5 away from the long term average of 6.5.This has also a probability of 1/3. Finally, with water levels of 4 or 9, we are 2.5 away from the longterm average and again this will happen in a third of the year on the long run. So, the long term averagefluctuation if this models holds, will always tend to be about

    E[fi] = E[|Xi E[Xi]|] = 2.5 13+ 1.5 1

    3+ 0.5 1

    3= 1.5.

    after many years. To understand why simply consider the fluctuations f1, f2, f3, . . .. By the Law of LargeNumbers applied to them we get that for n large, the average fluctuation is approximately equal to itsexpectation:

    f1 + f2 + . . .+ fnn

    E[fi] = E[|Xi E[Xi]|]. (4.1)So, now matter, what after many years, we will always now what the average fluctuation is approxi-

    mately equal to: the expression on the right side of 4.1

    The real number

    Long term average fluctuation = E[|Xi E[Xi]|] (4.2)is a measure of the dispersion (around the expectation) in our model. It should be obviouswhy this dispersion is important: if it is small people of Geneva will be safe. If it is big,they will often have to deal with flooding. So, in some sense, we can view the value givenin 4.2 as a measure of risk: if the dispersion is 0, then there is no risk and the randomnumber is not random but always equal to the fixed value E[X1]!

    In modern statistics, one considers however most often a number which represents thesame idea, but can be slightly dierent from 4.2. The number we will use most often, isnot the average fluctuation, but instead the square root of the average fluctuation square.This number is called the standard deviation of a random variable. We usually denote itby , so

    X :=E[(X E[X])2].

    19

  • The long term average fluctuation square of a random variable X is also called variance,and will be denoted by V AR[X] so that

    V AR[X] := E[(X E[X])2].With this definition the standard deviation is simply the square root of the variance:

    X =V AR[X].

    In most cases, X and our other measure of dispersion given by E[|X E[x]|] are almostequal.

    Let us go back to our example. the variance is the average fluctuation square. We get thus:

    V AR[Xi] = E[f2i ] = 2.5

    2 13+ 1.5 1

    3+ 0.52 1

    3= 2.91

    and hence the standard deviation is

    Xi =V AR[Xi] =

    2.91 1.7

    So, we see the average fluctuation size was E[|Xi E[Xi]|] = 1.5 whilst the standard deviation is (only)about 13% bigger.

    Now, standard deviation is most often used for determining the order of magnitude of arandom imprecision. So, we don t care about knowing absolutely exactly that number:instead we just want the order of magnitude. In other words, in most applications,E[|Xi E[Xi]| and Xi are suciently close to each other, that for applications it doesnot matter which one of the two we take! But, it will turn out that the standard deviationallows for certain calculations which the other measure of dispersion in 4.2 does not allowfor. So, we will work more often with the standard deviation than the other.

    4.1 Matzingers rule of thumb

    A rule of thumb is that:most variables most of the time take values not further than two standard deviations fromtheir expected values. We could thus write in a lose way:

    X E[X] 2X .To understand where this rule comes from simply think of the following: for exampleaverage American household income is around 70.000. How many households make morethan twice that much, that is above 140000? Certainly not a very large portion of thepopulation. Now, in our case the argument is not about average, but about the averagefluctuation. Still it is an average. So, what is true for averages should also be true foran average of fluctuations....We will see below Chebyche rule which is the worst possible scenario. The probability

    20

  • for any random variable to be further than 2 standard deviations from it expected valuecould be as much as 25% but never more:

    P (|Z E[Z]| 2Z) 0.25.the above inequality holds for any random variable, so it represents in some sense theworse case. Inequality ?? will be proven in our section on chebyche.For normal variables, the probability to be further than two standard deviations is muchsmaller: it is about 0.05. Now, we will see in the section on central limit theorem, thatany sum of many independent random contributions is approximately normal as soon asthey follow about the same model. Now, 0.0 is much smaller than 0.25. In real life, inmany cases, one will be in between these two possibilities. This rule of thumb is extremelyuseful when analyzing data and trying to get the big picture!

    5 Calculation with the Variance

    Let X be the outcome of a random experiment. We define the variance of X to be equalto:

    V AR[X] := E[(X E[X])2 ] .

    The square root of the variance is called standard deviation:

    X :=V AR[X].

    The standard deviation is a measure for the typical order of magnitude of how far awaythe value we get after doing the experiment once, is from E[X].

    Lemma 5.1 Let a be a non-random number and X the outcome of a random experiment.Then:

    V AR[aX] = a2V AR[X].

    Proof. We have:

    V AR[aX] =E[(aX E[aX])2] = E[(aX aE[X])2] ==E[a2 (X E[X])2] = a2E[(X E[X])2] = a2 V AR[X],

    which finishes to prove that: V AR[aX] = a2V AR[X].

    Lemma 5.2 Let X be the outcome of a random experiment, (in other words a randomvariable). Then:

    V AR[X] = E[X2] (E[X])2.Proof. We have that

    E[(X E[X])2] = E[X2 2XE[X] + E[X]2] = E[X2] 2E[XE[X]] + E[E[X]2]. (5.1)

    21

  • Now E[X] is a constant and constants can be taken out of the expectation. This impliesthat

    E[XE[X]] = E[X]E[X] = E[X]2. (5.2)

    On the other hand, the expectation of a constant is the constant itself. Thus, since E[X]2

    is a constant, we find:E[E[X]2] = E[X]2. (5.3)

    Using equation 5.2 and 5.3 with 5.1 we find

    E[(X E[X])2] = E[X2] 2E[X]2 + E[X]2 = E[X2] E[X]2.this finishes to prove that V AR[X] = E[X2]E[X]2.

    Lemma 5.3 Let X and Y be the outcomes of two random experiments, which are inde-pendent of each other. Then:

    V AR[X + Y ] = V AR[X] + V AR[Y ].

    Proof. We have:

    V AR[X + Y ] =E[((X + Y ) E[X + Y ])2] = E[(X + Y E[X] E[Y ])2] =E[((X E[X]) + (Y E[Y ]))2] =

    =E[(X E[X])2 + 2(X E[X])(Y E[Y ]) + (Y E[Y ])2] ==E[(X E[X])2] + 2E[(X E[X])(Y E[Y ])] + E[(Y E[Y ])2] =

    Since X and Y are independent, we have that (X E[X]) is also independent from(Y E[Y ]). Thus, we can use lemma 3.3, which says that the expectation of a productequals the product of the expectations in case the variables are independent. We find:

    E[(X E[X])(Y E[Y ])] = E[X E[X]] E[Y E[Y ]].Furthermore:

    E[X E[X]] = E[X] E[E[X]] = E[X] E[X] = 0Thus

    E[(X E[X])(Y E[Y ])] = 0.Applying this to the above formula for V AR[X + Y ], we get:

    V AR[X + Y ] =E[(X E[X])2] + 2E[(X E[X])(Y E[Y ])] + E[(Y E[Y ])2] == E[(X E[X])2] + E[(Y E[Y ])2] = V AR[X] + V AR[Y ].

    This finishes our proof.

    22

  • 5.1 Getting the big picture with the help of Matzingers rule of

    thumb

    We mentioned that most of the time, any random variable takes values no further than twotimes its standard deviation from its expectation. We can apply this and our calculationfor variance to understand how insurances work, hedging investments, and even statisticalestimation work. Let X1, X2, . . . be a sequence of random variables which all follow thesame model and are independent of each other. Let Z be the sum of n such variables:

    Z = X1 +X2 + . . .+Xn

    We find that

    E[Z] = E[X1 +X2 + . . .+Xn] = E[X1] + E[X2] + . . .+ E[Xn] = nE[X1]

    Similarly we can use the fact that the variance of a sum of independent variables is thesum of the variance to find:

    V AR[Z] = V AR[X1+X2+. . .+Xn] = V AR[X1]+V AR[X2]+. . .+V AR[Xn] = nV AR[X1].

    Using the last equation above with the fact that standard deviation is the square root ofvariance we find:

    Z =V AR[Z] =

    nV AR[X1] =

    nX1 .

    In other words: the sum of n independents has its expectation grow like n times constant,but the standard deviation grows only like square root of n times constant!!!! This iseverything you need to know for understanding how insurances and other risk reducerswork....Let us see dierent examples, what these random numbers Xi could represent:

    Say you are an insurance company specializing in proving life insurance. Let Xi bethe claim in the current year of the ith client. You have n clients, so the total claimwhich you as a company will have to pay is Z = X1 +X2 + . . .+Xn.

    You buy houses which you flip and then try to sell at a profit. You have boughthouses all over the US. Assuming the economy and real estate market stays verystable, we can assume that the selling prices will be independent of each other.So, let Xi represent the profit (or loss) for the i-th house which you are currentlyrenovating. This profit or loss is random, since you dont know exactly what itwill be until you sell. Assume that you have currently n houses which you arerenovating. Then, Z = X1 + . . .+Xn is your total profit or loss, with the n-housesyou are currently holding. This is a random variable since its outcome is not knowin advance.

    23

    Rizwan

    Rizwan

  • So, again it is all based on the following two equations which hold when the Xis areindependent and follow the same model:

    X1+...+Xn = X1 n

    E[X1 + . . .+Xn] = nE[X1]

    So for example with n = 1000000, we get

    Z = 1000Xi

    whilstE[Z] = 10000000E[Xi]

    so Z becomes negligence compared to E[Z]. So, if we think that most of the times avariable is within two standard deviations of its expectation, we find

    Z 1000000E[X1] 1000X1so, compared to the order of magnitude of Z the fluctuation becomes almost negligible!

    6 Covariance and correlation

    Two random variables are dependent when their joint distribution is not simply the prod-uct of their marginal distribution. But the degree of dependence can vary from strongdependence to loose dependence. One measure of the degree of dependence of randomvariables is Covariance. For random variablesX and Y we define the covariance as follows:

    COV [X, Y ] = E[(X E[X])(Y E[Y ])]Lemma:

    For random variables X and Y there is also another equivalent formula for the covariance:

    COV [X, Y ] = E[XY ]E[X]E[Y ]Proof:

    E[(X E[X])(Y E[Y ])] = E[XY Y E[X]XE[Y ] + E[X]E[Y ]]= E[XY ]E[X]E[Y ]] E[X]E[Y ]] + E[X]E[Y ]]

    = E[XY ] E[X]E[Y ]]Lemma:

    For independent random variables X and Y,

    COV [X, Y ] = 0

    24

    RizwanTEST II

    Rizwan

    Rizwan

  • Proof:

    COV [X, Y ] = E[XY ]E[X]E[Y ]For independent X and Y, E[XY ] = E[X]E[Y ]. Hence COV [X, Y ] = 0Lemma:

    COV [X,X] = V AR[X]

    Proof:

    COV [X,X] = E[X2]E[X]2 = V AR[X]Lemma:

    Assume that a is a constant and let X and Y be two random variables. Then

    COV [X + a, Y ] = COV [X, Y ]

    Proof:

    COV [X + a, Y ] = E[(X + aE[X + a])(Y E[Y ])]= E[Y X + Y a Y E[X + a]XE[Y ] aE[Y ] + E[Y ]E[X + a]]

    = E[XY ] + aE[Y ] E[Y ]E[X + a]E[X]E[Y ] aE[Y ] + E[Y ]E[X + a]= E[XY ]E[X]E[Y ]

    = COV [X, Y ]

    Lemma:

    Let a be a constant and let X and Y be random variables. Then

    COV [aX, Y ] = aCOV [X, Y ]

    Proof:

    COV [aX, Y ] = E[(aX E[aX])(Y E[Y ])]= E[aXY Y E[aX] aXE[Y ] + E[aX]E[Y ]]

    = aE[XY ] aE[X]E[Y ] aE[X]E[Y ] + aE[X]E[Y ]= aE[XY ] aE[X]E[Y ]= a(E[XY ] E[X]E[Y ])

    = aCOV [X, Y ]

    25

    Rizwan

  • Lemma:

    For any random variables X, Y and Z we have:

    COV [Z +X, Y ] = COV [Z, Y ] + COV [X, Y ]

    Proof:

    COV [Z +X, Y ] = E[(X + Z E[X + Z])(Y E[Y ])]= E[Y X + Y Z Y E[X + Z]XE[Y ] ZE[Y ] + E[X + Z]E[Y ]]

    = E[Y X] + E[Y Z] E[Y ]E[X + Z]E[X]E[Y ]E[Z]E[Y ] + E[X + Z]E[Y ]since E[A+B] = E[A] + E[B], we get

    = E[Y X]+E[Y Z]E[Y ]E[X]E[Y ]E[Z]E[X]E[Y ]E[Z]E[Y ]E[X]E[Y ]+E[Z]E[Y ]= E[Y X] + E[Y Z]E[X]E[Y ] E[Z]E[Y ]= E[Y X] E[X]E[Y ] + E[Y Z] E[Z]E[Y ]

    = COV [X, Y ] + COV [Z, Y ]

    NOTE THAT COV [X, Y ] = COV [Y,X]

    6.1 Correlation

    We define the correlation as follows:

    COR[X, Y ] =COV [X, Y ]

    V AR[X]V AR[Y ]

    One can prove that correlation is always between 1 and 1. When the variables areindependent the correlation is zero. The correlation is one when Y can be written asY = a + bX where a, b are constants such that b > 0. If the correlation is 1 then thevariable Y can be written as Y = a + bX where b is a negative constant and a is anyconstant.An important property of the correlation is that when we multiply the variable by aconstant then the correlation does not change: COR[aX, Y ] = COR[X, Y ]. This impliesthat a change of units does not aect correlation.

    7 Chebyshevs and Markovs inequalities

    Let us first explain the Markov inequality with an example.For this assume the dividend payed (per share) next year to be a random variable. Letthe expected amount of money payed be equal to E[X] = 2 dollars. Then the probabilitythat the dividend pays more than 100 dollars can not be more then 2/100 = E[X]/100since otherwise the expectation would have to be bigger than 2. In other words, for arandom variable X which can not take negative values, the probability that the randomvariable is bigger than a is at most E[X]/a. This is the content of the next lemma:

    26

    Rizwan

    Rizwan

  • Lemma 7.1 Assume that a > 0 is a constant and let X be a random variable taking ononly non-negative values, i.e. P (X 0) = 1. Then,

    P (X a) E[X]a

    .

    Proof. To simplify the notation, we assume that the variable takes on only integer values.The result remains valid otherwise. We have that

    E[X] = 0 P (X = 0) + 1 P (X = 1) + 2 P (X = 2) + 3 P (X = 3) + . . . (7.1)Note that the sum on the right side of the above inequality contains only non-negativeterms. If we leave out some of these terms, the value can only decrease or stay equal. Weare going to just keep the values x P (X = x) for x greater equal to a. This way equation7.1, becomes

    E[X] xa P (X = xa)+ (xa+1) P (X = xa+1)+ (xa+2) P (X = xa+2)+ . . . (7.2)where xa denotes the smallest natural number which is larger or equal to a. Note thatxa + i a for any i natural number. With this we obtain that the right side of 7.2 islarger-equal than

    a(P (X = xa) + P (X = xa + 1) + P (X = xa + 2) + . . .) = aP (X a).and hence

    E[X] aP (X a).The last inequality above implies:

    P (X a) E[X]a

    .

    The inequality given in the last lemma is called Markov inequality. In is very useful: inmany real world situations it is dicult to estimate all the probabilities (the probabilitydistribution) for a random variable. However, it might be easier to estimate the expecta-tion, since that is just one number. If we know the expectation of a random variable, wecan at least get upper-bounds on the probability to be far away from the expectation.

    Let us next present the Chebyche inequality:

    Lemma 7.2 If X is a random variable with expectation E[X] and variance VAR[X] anda 0 is a non-random number, then

    P (|X E[X]| a) V AR[X]a2

    27

  • Proof.

    Note that |X E[X]| a implies (X E[X])2 a2 and vice versa. Hence,P (|X E[X]| a) = P ((X E[X])2 a2) (7.3)

    If Y = (X E[X])2, then Y is a non-negative variable, andP (|X E[X]| a) = P (Y a2). (7.4)

    Since Y is non-negative, Markov inequality applies. Hence,

    P (Y a2) E[Y ]a2

    =E[(X E[X])2]

    a2=V AR[X]

    a2

    Using the last chain of inequalities above with equalities 7.3 and 7.4 we find

    P (|X E[X]| a) V AR[X]a2

    Let us consider one more example. Assume the total expected claim at the end of nextyear for an insurance company is 1000000$. What is the risk that the insurance companyhas to pay more than 5000000 as total claim at the end of next year? The answer goesas follows:let Z be the total claim at the end of next year. By Markov inequality, we find

    P (Z 5000000) E[Z]5000000

    =1

    5= 20%.

    Hence, we know that the probability to have to pay more than five millions is at most20%. To derive this the only information needed was the expectation of Z.When the standard deviation is also available, one can usually get better bounds usingthe Chebycheef inequality. Assume in the example above that the expected total claim isas before, but let the standard deviation of the total claim be one million. Then we haveV AR[Z] = (1000000)2.

    Note that for Z to be above 5000000 we need ZE[Z] to be above 4000000. Hence,P (Z 5000000) = P (Z E[Z] 4000000) P (|Z E[Z]| 4000000).

    Using Chebyche, we get

    P (|Z 1000000| 4000000) V AR[Z](4000000)2

    =1

    16= 0.0625.

    It follows that the probability that the total claim is above five millions is less than 6.25percent. This is a lot less than the bound we had found using Markovs inequality.

    28

  • 8 Combinatorics

    Theorem 8.1 Let = {x1, x2, . . . , xs} .

    denote the state space of a random experiment. Let each possible outcome have sameprobability. Let E be an event. Then,

    P (E) =number of outcomes in E

    total number of outcomes=

    |E|s

    Proof. We know thatP () = 1

    Now

    P () = P (X {x1, . . . , xs}) = P ({X = x1} . . . {X = xs}) =

    t=1,...,s

    P (X = xt)

    since all the outcomes have equal probability, we have thatt=1,...,s

    P (X = xt) = sP (X = x1).

    Thus,

    P (X = x1) =1

    s.

    Now if:E = {y1, . . . , yj}

    We find that:

    P (E) = P (X E) = P ({X = y1} . . . {Xj = yj}) =j

    i=1

    P (X = yi) =j

    s

    which finishes the proof.Next we present one of the main principles used in combinatorics:

    Lemma 8.1 Let m1, m2, . . . , mr denote a given finite sequence of natural numbers. As-sume that we have to make a sequence of r choices. At the s-th choice, assume that wehave ms possibilities to choose from. Then the total number of possibilities is:

    m1 m2 . . . mrWhy this lemma holds can best be understood when thinking of a tree, where at eachknot which is s away from the root we have ms new branches.

    29

  • Example 8.1 Assume we first throw a coin with a side 0 and a side 1. Then we throw afour sided die. Eventually we throw the coin again. For example we could get the number031. How many dier numbers are there which we could get? The answer is: First wehave to possibilities. For the second choice we have four, and eventually we have againtwo. Thus, m1 = 2, m2 = 4, m3 = 2. This implies that the total number of possibilities is:

    m1 m2 m3 = 2 4 2 = 16.Recall that the product of all natural numbers which are less or equal to k, is denoted byk!. k! is called k-factorial.

    Lemma 8.2 There arek!

    possibilities to put k dierent objects in a linear order. Thus there are k! permutations ofk elements.

    To realize why the last lemma above holds we use lemma 8.1. To place k dierent objectsin a row we first choose the first object which we will place down. For this we have kpossibilities. For the second object, there remain k 1 objects to choose from. For thethird, there are k 3 possibilities to choose from. And so on and so forth. This thengives that the total number of possibilities is equal to k (k 1) . . . 2 1.Lemma 8.3 There are:

    n!

    (n k)!possibilities to pick k out of n dierent objects, when the order in which we pick themmatters.

    For the first object, we have n possibilities. For the second object we pick, we have n 1remaining objects to choose from. For the last object which we pick, (that is the k-thwhich we pick), we have n k + 1 remaining objects to choose from. Thus the totalnumber of possibilities is equal to:

    n (n 1) . . . (n k + 1)which is equal to:

    n!

    (n k)! .

    The number n!(nk)! is also equal to the number of words of length k written with a n-letteralphabet, when we require that the words never contain twice the same letter.

    Lemma 8.4 There are:n!

    k!(n k)!subsets of size k in a set of size n.

    30

  • The reason why the last lemma holds is the following: there are k! ways of putting agiven subset of size k into dierent orders. Thus, there are k! times more ways to pick kelements, than there are subsets of size k.

    Lemma 8.5 There are:2n

    subsets of any size in a set of size n.

    The reason why the last lemma above holds is the following: we can identify the subsetstwo binary vectors with n entries. For example, let n = 5. Let the set we consider be{1, 2, 3, 4, 5}. Take the binary vector:

    (1, 1, 1, 0, 0).

    This vector would correspond to the subset containing the first three elements of the set,thus to the subset:

    {1, 2, 3}.So, for every non zero entry in the vector we pick the corresponding element in the set. Itis clear that this correspondence between subsets of a set of size n and binary vectors ofdimension n is one to one. Thus, there is the same number of subsets as there is binaryvectors of length n. The total number of binary vectors of dimension n however is 2n.

    9 Important discrete random variables

    9.1 Bernoulli variable

    Let a coin have a side 0 and a side 1. Let p be the probability of side 1 and 1 p bethe probability of side 0. Let X designate the random number we obtain when we flipthis coin. Thus, with probability p the random variable X takes on the value 1 and withprobability 1 p it takes on the value 0. The random variable X is called a Bernoullivariable with parameter p. It is named after the famous swiss mathematician Bernoulli.For a Bernoulli variable X with parameter p we have:

    E[X] = p. V AR[X] = p(1 p).

    Let us show this:E[X] = 1 p+ 0 (1 p) = p.

    For the variance we find:

    V AR[X] = E[X2] (E[X])2 = 12 p+ 02 (1 p) (E[X])2 = p p2 = p(1 p).

    31

  • 9.2 Binomial random variable

    Again, let a coin have a side 0 and a side 1. Let p be the probability of side 1 and 1p bethe probability of side 0. We toss this coin independently n times and count the numbersof 1s observed. The number Z of 1s observed after n coin-tosses is equal to

    Z := X1 +X2 + . . .+Xn

    where Xi designates the result of the i-th toss. (Hence the Xis are independent Bernoullivariables with parameter p.) The random variable Z is called a binomial variable withparameter p and n. For the binomial random variable with parameter p we find:

    E[Z] = np V AR[Z] = np(1 p) For k n, we have: P (Z = k) = (nk)pk(1 p)nk.

    Let us show the above statements:

    E[Z] = E[X1 + . . .+Xn] = E[X1] + . . .+ E[Xn] = n E[X1] = n p.Also:

    V AR[Z] = V AR[X1 + . . .+Xn] = V AR[X1] + . . .+ V AR[Xn] = nV AR[X1] = np(1 p).Let us calculate next the probability: P (Z = k). We start with an example. Take n = 3and k = 2. We want to calculate the probability to observe exactly to ones among thefirst three coin tosses. To observe exactly two ones out of three successive trials there areexactly three possibilities:

    Let A be the event: X1 = 1, X2 = 1, X3 = 0 Let B be the event: X1 = 1, X2 = 0, X3 = 1 Let C be the event: X1 = 0, X2 = 1, X3 = 1.

    Each of these possibilities has probability p2(1 p). As a matter of fact, since the trialsare independent we have for example:

    P (X1 = 1, X2 = 1, X3 = 0) = P (X1 = 1)P (X2 = 1)P (X3 = 0) = p2(1 p).

    The three dierent possibilities are disjoint of each other. Thus,

    P (Z = 2) = P (A B C) = P (A) + P (B) + P (C) = 3p2(1 p).Here 3 is the number of realization where we have exactly two ones within the first threecoin tosses. This is equal to the dierent number of ways, there is to choose two dierentobjects out of three items. In other words the number three stand in our formula for 3

    32

  • choose 2.We can now generalize to n trials and a number k n. There are n choose k possibleoutcomes for which among the first n coin tosses there appear exactly k ones. Each ofthese outcomes has probability:

    pk(1 p)(nk).This gives then:

    P (Z = k) =

    (n

    k

    )pk(1 p)(nk).

    9.3 Geometric random variable

    Again, let a coin have a side 0 and a side 1. Let p be the probability of side 1 and 1p bethe probability of side 0. We toss this coin independently n many times. Let Xi designatethe result of the i-th coin-toss. Let T designate the number of trials it takes until we firstobserve a 1. For example, if we have:

    X1 = 0, X2 = 0, X3 = 1

    we would have that T = 3. If we observe on the other hand:

    X1 = 0, X2 = 1

    we have that T = 2. T is a random variable. As we are going to show, we have:

    For k > 0, we have P (T = k) = p(1 p)k1. E[T ] = 1/p V AR[T ] = (1 p)/p2

    A random variable T for which P (T = k) = p(1 p)k1, k N, is called geometricrandom variable with parameter p. Let us next prove the above statements: For T to beequal to k we need to observe k 1 time a zero followed by a one. Thus:

    P (T = k) = P (X1 = 0, X2 = 0, . . . , Xk1 = 0, Xk = 1) =

    P (X1 = 0) P (X2 = 0) . . . P (Xk1 = 0) P (Xk = 1) = (1 p)k1p.Let us calculate the expectation of T . We find:

    E[T ] =k=1

    kp(1 p)k1

    Let f(x) be the function:

    x / f(x) =k=1

    kxk1.

    33

  • We have that

    f(x) =k=1

    d(xk)

    dx=d(

    k=1 xk)

    dx=d (x/(1 x))

    dx=

    1

    1 x +x

    (1 x)2 =1

    (1 x)2 (9.1)

    This shows that:k=1

    k(1 p)k1 = f(1 p) = 1(p)2

    Thus,

    E[T ] = p (

    k=1

    k(1 p)k1)= p 1

    (p)2=

    1

    p.

    Let us next calculate the variance of a geometric random variable. We find:

    E[T 2] =k=1

    k2p(1 p)k1.

    Let g(.) be the map:

    x / g(x) =k=1

    k2(x)k1

    We find:

    g(x) =k=1

    kd(xk)

    dx=d(x

    k=1 kxk1)

    dx

    Using equation 9.1, we find:

    g(x) =d (x/(1 x)2)

    dx=

    1 + x

    (1 x)3 .

    This implies that

    E[T 2] = pg(1 p) = 2 pp2

    .

    Now,

    V AR[T ] = E[T 2] (E[T ]2) = 2 pp2

    (1

    p

    )2=

    1 pp2

    .

    10 Continuous random variables

    So far we have only been studying discrete random variables. Let us see how continuousrandom variables are defined.

    34

  • Definition 10.1 Let X be a number generated by a random experiment. (Such a ran-dom number is also called random variable). X is a continuous random variable if thereexists a non-negative piecewise continuous function

    f : x / f(x) R R+

    such that for any interval I = [i1, i2] R we have that:

    P (X I) =I

    f(x)dx.

    The function f(.) is called the density function of X or simply the density of X.

    Note that the notationI f(x)dx stands for:

    I

    f(x)dx =

    i2i1

    f(x)dx. (10.1)

    Recall also that integrals like the one appearing in equation 10.1 are defined to be equalto the air under the curve f(.) and above the interval I.

    Remark 10.1 Let f(.) be a piecewise continuous function from R into R. Then, thereexists a continuous random variable X such that f(.) is the density of X, if and only ifall of the following conditions are satisfied:

    1. f is everywhere non-negative.

    2.Rf(x)dx = 1.

    Let us next give some important examples of continuous random variables:

    The uniform variable in the interval I = [i1, i2], where i1 < i2. The density of f(.)is equal to 1/|i2 i2| everywhere in the interval I. Anywhere outside the interval I,f(.) is equal to zero.

    The standard normal variable has density:

    f(x) :=12

    ex2/2.

    A standard normal random variable is often denoted by N (0, 1). Let R, > 0 be given numbers. The density of the normal variable withexpectation and standard deviation is defined to be equal to:

    f(x) :=12

    e(x)2/22 .

    35

  • 11 Normal random variables

    The probability of a normal random variable is given by the probability density:

    f(x) =12

    e(x)2/22 .

    Hence there are two parameters which determine a normal distribution: and . Wewrite N (, ) for an normal variable with parameters and .If we analyze the density function f(x), we see that for any value a, we have that f(+a) =f( a). Hence the function f(.) is symmetric at the point . This implies that theexpected value has to be :

    E[N (, )] = .One could also show this by verifying that

    E[N (, ) =

    x12

    e(x)2/22dx = 0.

    By integration by parts, one can show thatx2

    12

    e(x)2/22dx = 2

    and hence the variance is 2:V AR[N (, )] = 2.

    Note that the function f(x) decreases when we go away from : the shape is a bell shapewith maximum at and width . (Go onto the Internet to see the graph of a normaldensity plotted.)

    Let us give next a few very useful facts about normal variables:

    Let a and b be two constants such that a = 0. Let X be a normal variable withparameters X and X . Let Y be the random variable defined by ane transfor-mation from X in the following way: Y := aX + b. Then Y is also normal. Theparameters of Y are

    E[Y ] = Y = aX + b

    andY = aX .

    This we obtain simply from the fact that these parameters are expectation andstandard deviation of their respective variables.

    LetX and Y be normal variables independent of each others. Let Z := X+Y . ThenZ is also normal. The same result is true for sums of more than two independentnormal variables.

    36

  • If X is normal, thenZ :=

    X E[X]V AR[X]

    is a standard normal.

    For the last point above note that for any random variable X (not necessarily normal) wehave that if Z = (X E[X])/X , then Z has expectation zero and standard deviation 1.This is a simple straight forward calculation:

    E[Z] = E

    [X E[X]

    X

    ]=

    1

    X(E[X] E[E[X]]) (11.1)

    but since E[E[X]] = E[X], equality 11.1 implies that E[Z] = 0. Also

    V AR[Z] = V AR

    [X E[X]

    X

    ]= V AR[X]/2 = 1.

    Now if X is normal then we saw that Z = (X E[X])/X is also normal, since Z isjust obtained from X by multiplying and adding constants. But Z has expectation 0 andstandard deviation 1 and hence it is standard normal.One can use normal variables to model financial processes and many others. Let usconsider an example. Assume that a portfolio consists of three stocks. Let Xi denote thevalue of stock number i in one year from now. We assume that the three stocks in theportfolio are all independent of each others and normally distributed so that i = E[Xi]and i =

    V AR[Xi] for i = 1, 2, 3. Let

    1 = 100, 2 = 110, 3 = 120

    and let1 = 10, 2 = 20, 3 = 20.

    The value of the portfolio after one year is Z = X1+X2+X3 and E[Z] = E[X1]+E[X2]+E[X3] = 330. : Question: What is the probability that the value of the portfolio after ayear is above 360?Answer: We have that

    V AR[Z] = V AR[X1] + V AR[X2] + V AR[X3] = 100 + 400 + 400 = 900

    and henceZ =

    V AR[Z] = 30.

    We are now going to calculateP (Z 330).

    For this we want to transform the probability into a probability involving a standardnormal since for standard normal we have tables available. We find

    P (Z 360) = P(Z E[Z]

    Z 360 E[Z]

    Z

    ). (11.2)

    37

  • Note that(360 E[Z])/z = 1

    and also (Z E[Z])/Z is standard normal. Using this in equation 11.2, we find that theprobability that the portfolio after a year is above 360 is equal to

    P (Z 360) = P (N (0, 1) 1) = 1 (1),where (1) = P (N (0, 1) 1) = 0.8413 can be found in a table for the standard normal.

    12 Distribution functions

    Definition 12.1 Let X be a random variable. The distribution function FX : R R ofX is defined in the following way:

    FX(s) := P (X s)for all s R.Let us next mention a few properties of the distribution function:

    FX is an increasing function. This means that for any two numbers s < t in R, wehave that FX(s) FX(t).

    lims FX(s) = 1 lims FX(s) = 0

    We leave the proof of the three facts above to the reader.Imagine next that X is a continuous random variable with density function fX . Then,we have for all s R, that:

    FX(s) = P (X s) = s

    fX(t)dt.

    Taking the derivative on all sides of the above system of equations we find that:

    dFX(s)

    ds= fX(s).

    In other words, for a continuous random variables X, the derivative of the distributionfunction is equal to the density of X. Hence, in this case, the distribution function isdierentiable and thus also continuous. Another implication is: the distribution functionuniquely determines the density function of f . This implies, that the distribution functiondetermines uniquely all the probabilities of events which can be defined in terms of X.

    Assume next that the random variable X has a finite state space:

    X = {s1, s2, . . . , sr}

    38

  • such that s1 < s2 < . . . < sr. Then, the distribution function FX is a step function. Leftof s1, we have that FX is equal to zero. Right of sr it is equal to one. Between si andsi+1, that is on the interval [si, si+1[, the distribution function is constantly equal to:

    ji

    P (X = sj).

    (This holds for all i between 1 and r 1.)To sum up: for continuous random variables the distribution functions are dierentiablefunctions, whilst for discrete random variables the distribution functions are step func-tions. Let us next show how we can use the distribution function to simulate randomvariables. The situation is the following: our computer can generate a uniform randomvariable U in the interval [0, 1]. (This is a random variable with density equal to 1 in [0, 1]and 0 everywhere else.) We want to generate a random variable with a given probabilitydensity function fX , using U . We do this in the following manner: we plug the randomnumber U into the map invFX . (Here invFX designates the inverse map of FX(.).) Thenext lemma says that this method really produces a random variable with the desireddensity function.

    Lemma 12.1 Let fX denote the density function of a continuous random variable andlet FX designate its distribution function. Let Y designate the random variable obtainedby plugging the uniform random variable U into the inverse distribution function:

    Y := invFX(U).

    Then, the density of Y is equal to fX.

    Proof. Since, F (.) is an increasing function. Thus for any number s we have:

    Y s.is equivalent to

    FX(Y ) FX(s).Hence:

    P (Y s) = P (FX(Y ) FX(s)).Now, FX(Y ) = U , thus

    P (Y s) = P (U FX(s)).We know that FX(s) [0, 1]. Using the fact that U has density function equal to one inthe interval [0, 1], we find:

    P (U FX(s)) = FX(s)0

    1 dt = FX(s).

    ThusP (Y s) = FX(s).

    39

  • This shows that the distribution function FY of Y is equal to FX(s). Applying thederivative according to s to both FY (s) and FX(s), yields:

    fY (s) = fX(s).

    Hence, X and Y have same density function. This finishes the proof.

    13 Expectation and variance for continuous random

    variables

    Definition 13.1 Let X be a continuous random variable with density function fX(.).Then, we define the expectation E[X] of X to be:

    E[X] :=

    sfX(s)ds.

    Next we are going to prove that the law of large numbers also holds for continuous randomvariables.

    Theorem 13.1 Let X1, X2, . . . be a sequence of i.i.d. continuous random variables allwith same density function fX(.). Then,

    limn

    X1 +X2 + . . .+Xnn

    = E[X1].

    Proof. Let > 0 be a fix number. Let us approximate the continuous variables Xi bya discrete variable Xi . For this we let X

    i be the largest integer multiple of which is

    still smaller equal to Xi. In this way, we always get that

    |Xi Xi| < .This implies that: X1 +X2 + . . .+Xnn X

    1 +X

    2 + . . .+X

    n

    n

    < However the variables Xi are discrete. So for them the law of large number has alreadybeen proven and we find:

    limn

    X1 +X2 + . . .+X

    n

    n= E

    [X1

    ](13.1)

    We have thatE[Xi ] =

    zZ

    z P (Xi = z)

    40

  • However, by definition:

    P (Xi = z) = P (Xi [z, (z + 1)[).The expression on the right side of the last inequality is equal to (z+1)

    z

    fX(s)ds.

    Thus

    E[Xi ] =zZ

    z

    (z+1)z

    fX(s)ds.

    As tends to zero, the expression on the left side of the last equality above tends to:

    sfX(s)ds

    This implies that by taking fix and suciently small, we have that, for large enough n, the fraction

    X1 +X2 + . . .+Xnn

    is as close as we want from

    sfX(s)ds.

    This implies thatX1 +X2 + . . .+Xn

    nactually converges to

    sfX(s)ds.

    The linearity of expectation holds in the same way as for discrete random variables.This is the content of the next lemma.

    Lemma 13.1 Let X and Y be two continuous random variables and let a be a number.Then

    E[X + Y ] = E[X] + E[Y ]

    andE[aX] = aE[X]

    Proof. The proof goes like in the discrete case: The only thing used for the proof in thediscrete case is the law of large numbers. Since the central limit theorem also holds forthe continuous case, the exactly same proof holds for the continuous case.

    41

  • 14 Central limit theorem

    The Central Limit Theorem (CLT) is one of the most important theorems in probability.Roughly speaking it says that if we build the sum of many independent random variables,no matter what these little contributions are, we will always get approximately a normaldistribution. This is very important in every day life, because often times you havesituations where a lot of little independent things add up. So, you end up observingsomething which is approximately a normal random variable. For example, when youmake a measurement you are most of the time in this situation. That is, when you dontmake one big measurement error. In that case, you have a lot of little imprecisions whichadd up to give you your measurement error. Most of the time, these imprecisions can beseen as close to being independent of each other. This then implies: unless you make onebig error, you will always end up having your measurement-error being close to a normalvariable.Let X1, X2, X3, . . . be a sequence of independent, identically distributed random variables.(This means that they are the outcome of the same random experiment repeated severaltimes independently.) Let denote the expectation := E[X1] and let denote thestandard deviation :=

    V AR[X1]. Let Z denote the sum

    Z := X1 +X2 +X3 + . . .+Xn.

    Then, by the calculation rules we learned for expectation and variance it follows that:

    E[Z] = n

    and the standard deviation Z of Z is equal to:

    Z =n.

    When you subtract from a random variable its mean and divide by the standard deviationthen you always get a new variable with zero expectation and variance equal to one. Thusthe standardized sum:

    Z nn

    has expectation zero and standard deviation 1. The central limit theorem says that ontop of this, for large n, the expression

    Z nn

    is close to being a standard normal variable. Let us now formulate the central limittheorem:

    Theorem 14.1 LetX1, X2, X3, . . .

    42

  • be a sequence of independent, identically distributed random variables. Then we have thatfor large n, the normalized sum Y :

    Y :=X1 + . . .+Xn n

    n

    is close to being a standard normal random variable.

    The version of the Central Limit Theorem is not yet very precise. As a matter of fact, whatmeans close to being a standard normal random variable? We certainly understandwhat means that two points are close to each other. But we have not yet discussed theconcept of closeness for random variables. Let us do this by using the example of a six-sided die. Let us assume that we have a six-sided die which is not perfectly symmetric.For i {1, 2, . . . , 6}, let pi denote the probability of side i:

    P (X = i) = pi

    where X denotes the number which we get when we through this die once. A perfectlysymmetric die would have the probabilities pi all equal to 1/6. Say, our die is not exactlysymmetric but close to a perfectly symmetric die. What does this mean? This meansthat for all i {1, 2, . . . , 6} we have that pi is close to 1/6.For the die example with have a finite number of outcomes. For a continuous randomvariable on the other hand we are interested in the probabilities of intervals. By this Imeans that we are interested for a given interval I, in the probabilities that the randomexperiment gives result in I. If X denotes our continuous random variable, this meansthat we are interested in the probabilities of type:

    P (X I).We are now ready to explain what we mean by: two continuous random variables Xand Y have there probability laws close to each other. By X and Y are close (haveprobability laws which are closed to each other) we mean: for each interval I we havethat the real number P (Y I) is close to the real number P (X I). For the interval,i = [i1, i2] with i1 < i2, we have that

    P (X I) = P (X i2) P (X < i1).It follows that if we know all the probabilities for semi-infinite intervals we can determinethe probabilities of type P (X I). Thus, for two continuous random variables X andY to be close to each other (with respect to their probability law), it is enough to askthat for all x R we have that the real number P (X x) is close to the real numberP (Y y).Now that we have clarified the concept of closeness in distribution for continuous randomvariables, we are ready to formulate the CLT in a more precise way. Hence saying that

    Z :=X1 + . . .+Xn n

    n

    43

  • is close to a standard normal random variable N (0, 1) means that for every z R wehave that:

    P (Z z)is close to

    P (N (0, 1) z).In other words, as n goes to infinity, P (Z z) converges to P (N (0, 1) z). Let us givea more precise version of the CLT then what we have done so far:

    Theorem 14.2 LetX1, X2, X3, . . .

    be a sequence of independent, identically distributed random variables. Let E[X1] = andV AR[X1] = . Then, for any z Z, we have that:

    limn

    P

    (X1 + . . .+Xn n

    n

    z)= P (N (0, 1) z).

    15 Statistical testing

    Let us first give an example:Assume that you read in the newspaper that 50% of the population in Atlanta smokes.You dont believe that number, so you start a survey. You ask 100 randomly chosenpeople, and find that 70 out of the hundred smoke. Now, you want to know if the resultof your survey constitutes strong evidence against the 50% claimed by the newspaper.If the true percentage of the population of Atlanta which smokes would be 50%, youwould expect to find in your survey a number closer to 50 people. However, it could bethat although the true percentage is 50%, you still observe a figure as high as 70. Just bychance. So, the procedure is the following: determine the probability of getting 70 peopleor more in your survey who smoke, given that the percentage would really be 50%. If thatprobability is very small you decide to reject the idea that 50% of the population smokein Atlanta. In general one takes a fix level > 0 and rejects the idea one wants to testif the probability is smaller than . Most of the times statisticians work with beingequal to 0.05 or 0.1. So, if the probability of getting 70 people or more in our survey whosmoke is smaller than = 0.05(the probability given that 50% of the population smokes),then statisticains will say: we reject the hypothesis that 50% of the population in Atlantasmokes. We do this on the confidence level = 0.05, based on the evidence of our survey.How do we calculate the probability to observe 70 or more people in our survey whosmoke if the percentage would really be 50% of the Atlanta population? For this it isimportant how we choose, the people for our survey. The correct way to choose them isthe following: take a complete list of the inhabitants of Atlanta. Numerate them. Choose100 of them with replacement and with equal probability. This means that a person couldappear twice.Let Xi be equal to one if the i-th person chosen is a smoker. Then, if we chose the people

    44

  • following the procedure above we find that the Xis are i.i.d. and that P (Xi = 1) =p where p designates the true percentage of people in Atlanta who smoke. Then alsoE[Xi] = p. The total number of people in our survey who smoke Z, can now be expressedas

    Z := X1 +X2 + . . .+X100.

    Let P50%(.) designate the probability given that the true percentage which smoke is really50%. Testing if 50% in Atlanta smoke can now be discribed as follows:

    Calculate the probability:P50%(X1 + . . .+X100 70).

    If the above probability is smaller than = 0.05 we reject the hypothesis that 50%of the population smokes in Atlanta (we reject it on the = 0.05 level). Otherwise,we keep the hypothesis. When we keep the hypothesis, this means that the resultof our survey does not constitute strong evidence against the hypothesis: the resultof the survey does not contradict the hypothesis.

    Note that we could also have done the test on the = 0.1 level. In that case we wouldreject the hypothesis if that probability is smaller that 0.1.Next we are explaining how we can calculate approximately the probability P50%(Z 70),using the CLT. Simply note that, by basic algebra, the inequality

    Z 70is equivalent to

    Z n 70 nwhich is itself equivalent to:

    Z nn

    70 nn

    .

    Equivalent inequalities must also have same probability. Hence:

    P50%(Z 70) = P50%(Z n 70 n) = P50%(Z nn

    70 nn

    )(15.1)

    By the CLT we have thatZ nn

    is close to being a standard normal random variable N (0, 1). Thus, the probability onthe right side of inequality 15.1, is approximately equal to

    P

    (N (0, 1) 70 n

    n

    ). (15.2)

    If the probability in expression 15.2 is smaller than 0.05 then we reject the hypothesisthat 50% of the Atlanta populations smokes. (on the = 0.05 level). We can look up theprobability that the standard normal N (0, 1) is smaller than the number (70n)/(n)in a table. We have tables, for the standard normal variable N (0, 1).

    45

  • 15.1 Looking up probabilities for the standard normal in a table

    Let z R. Let (z) denote the probability that a standard normal variable is smallerequal than z. Thus:

    (z) := P (N (0, 1) z) = z

    1

    2ex

    2/2dx.

    For example, let z > 0 be a number. Say wen want to find the probability

    P (N (0, 1) z). (15.3)The table for the standard normal gives the values of (z) for z > 0 thus we have to

    try to express probability 15.3 in terms of (z). For this note that:

    P (N (0, 1) z) = 1 P (N (0, 1) < z).Furthermore, P (N (0, 1) < z) is equal to P (N (0, 1) z) = (z). Thus we find that:

    P (N (0, 1) z) = 1 (z).Let us next explain how, if z < 0, we can find the probability:

    P (N (0, 1) z).Note that N (0, 1) is symmetric around the origin. Thus,

    P (N (0, 1) z) = P (N (0, 1) |z|).This brings us back to the previously studied case. We find

    P (N (0, 1) z) = 1 (|z|).Eventually let z > 0 again. What is the probability:

    P (z N (0, 1) z)equal to? For this problem note that

    P (z N (0, 1) z) = 1 P (N (0, 1) z) P (N (0, 1) z).Thus, we find that:

    P (z N (0, 1) z) = 1 (1 (z)) (1 (z)) = 2(z) 1.

    46

  • 15.2 Two sample testing

    Let us give an example to introduce this subject. Assume that we are testing a new fuelfor a certain type of rocket. We would like to know if the new fuel gives a dierent initialvelocity to the rocket. The initial velocity with the old fuel is denote by X whilst Y isthe initial velocity with the new fuel. We fire the rocket five times with the old fuel andmeasure each time the initial velocity. We find:

    X1 = 100, X2 = 102, X3 = 97, X4 = 100, X5 = 101 (15.4)

    (here Xi denotes the initial velocity measured whilst firing the rocket for the i-th timewith the old fuel). Then we fire the rocket five times with the new fuel. Every time wemeasure the initial velocity. We find

    Y1 = 101, Y2 = 103, Y3 = 99, Y4 = 102, Y5 = 100 (15.5)

    We calculate the averages:

    X :=X1 +X2 +X3 +X4 +X5

    5= 100

    and

    Y :=Y1 + Y2 + Y3 + Y4 + Y5

    5= 101

    When we measure the initial velocities we find dierent values even when we use the samefuel. The reason is that our measurement instruments are not very precise, so we get thetrue value plus a measurement error. The model is as follows:

    Xi = X + Xi

    andYi = Y +

    Yi .

    Furthermore X1 , X2 , . . . are i.i.d. random errors and so are

    Y1 ,

    Y2 , . . .. We assume that

    the measurement instrument is well calibrated so that

    E[Xi ] = E[Yi ] = 0

    for all i = 1, 2, . . .. Here X and Y are unknown constants (in our example X is theinitial speed when we use the old fuel whilst Y is the initial speed when we use the newfuel). We find that

    E[Xi] = E[X + Xi ] = E[X ] + E[

    Xi ] = X + 0 = X ,

    and similarlyE[Yi] = Y .

    47

  • So our testing problem can be described as follows: we want to figure out based on ourdata 15.4 and 15.5, if the second fuel gives a dierent initial speed than the old fuel. Weobserved

    Y X = 1 > 0.This means that in the second sample, obtained with the new fuel, the initial speed ishigher by one unit on average to the initial speed in the first sample obtained with theold fuel. But is this evidence enough to conclude that the new fuel provides higher initialspeed, or could this dierence just be due to the measurement errors? As a matter offact, since we make measurement errors, it could be that even, if the second fuel doesnot provide higher initial speed, (i.e. X = y) that due to the random errors and badluck the second average is higher than the first. In our present setting we can neverbe absolutely sure, but we try to see if there is statistically significant evidence forarguing that X and Y are not equal.The exact method to do this depends on whether we know the standard deviation of theerrors or not and if they are identical for the two samples. We will need the expectationand standard deviation of the means. This is what we calculate in the next paragraph.

    Expectation and standard deviation of the means Let the standard deviation ofthe errors be denoted by

    X :=V AR[Xi ] , Y :=

    V AR[Yi ].

    Let Z := Y X. We find that the standard deviation of Z is given by

    Z = YX =V AR[Y X] =

    V AR[Y ] + V AR[X (15.6)

    where the last equality above was obtained using the facts that the X and the Y areindependent of each other, and variance of a sum of independent variable is equal to thesum of the variances. Now

    V AR[X] = V AR

    [X1 + . . .+Xn

    n

    ]=V AR[X1 + . . .+Xn]

    n2=

    =V AR[X1] + V AR[X2] + . . .+ V AR[Xn]

    n2=nV AR[X1]

    n2=V AR[X1]

    n=

    2Xn

    and similarly

    V AR[Y ] =2Yn

    Using this in equation 15.6, we find

    Z = YX =

    2Xn

    +2Yn

    (15.7)

    48

  • If X = Y (which should be the case when we use the same measurement instrument),then equation 15.7 can be rewritten as

    YX =

    2

    n+

    2

    n=

    2/n, (15.8)

    where = X = Y . If the two samples would have dierent sizes, we would find by asimilar calculation

    Z =

    Xn1

    +Yn2

    (15.9)

    where n1 is the size of the first sample and n2 is the size of the second sample. Furthermorewe have for the expectation

    E[Y X] = E[Y ]E[X] == E

    [Y1 + Y2 + . . .+ Yn

    n

    ] E

    [X1 + . . .+Xn

    n

    ]=

    =E[Y1 + . . .+ Yn]

    n E[X1 + . . .+Xn]

    n=

    =E[Y1] + E[Y2] + . . .+ E[Yn]

    n E[X1] + E[X2] + . . .+ E[Xn]

    n=

    = E[Y1] E[X1] = X YTo summarize we found that

    E[Y X] = Y X . (15.10)

    A simplified method Let us first explain a rough method, to explain in a simple waythe idea. This method up to a small detail is the same as what is really used in practice.

    At this stage we are ready to explain how we could proceed to know if we have strongevidence for the case Y X = 0. We are going to use the rule of thumb which saysthat in most cases for most variables the values we typically observe are