mast 20004 course notes 2014

492
Administration Administration Web site: accessible from through the Learning Management System Lectures Consultation hours Tutorial/Computer Lab classes Homework Assessment Slide 1

Upload: tom-attard

Post on 24-Nov-2015

50 views

Category:

Documents


4 download

DESCRIPTION

Mast 20004 Course Notes 2014

TRANSCRIPT

  • AdministrationAdministration Web site: accessible from through the Learning Management System Lectures Consultation hours Tutorial/Computer Lab classes Homework Assessment

    Slide 1

  • Probability is interestingProbability is interesting The Monty Hall Game The Birthday Problem The Bus-Stop Paradox

    Slide 2

  • The Monty Hall GameThe Monty Hall Game A prize lies behind one of three doors. The contestant chooses a door. Monty Hall (who knows which door the prize is behind) opens a door notchosen by the contestant that does not have the prize behind. Theremust be at least one such door.

    Monty Hall then offers the contestant the option of changing his/heroriginal selection to the other unopened door.

    Should the contestant change?

    Slide 3

  • The Birthday ProblemThe Birthday Problem

    Twenty three people are on a soccer pitch. What is the probability that thereare two people present with the same birthday?

    Slide 4

  • The Bus-Stop ParadoxThe Bus-Stop Paradox Buses on a particular route arrive at randomly-spaced intervalsthroughout the day.

    On average a bus arrives every hour. A passenger comes to the bus-stop at a random instant. What is the expected length of time that the passenger will have to waitfor a bus?

    Slide 5

  • ProbabilityProbabilityThere are many applications of probability in society. For example, we needto use probability theory to

    understand gambling games, design and analyse experiments in almost any field of science and socialscience,

    assign values to financial derivatives, design, dimension and control telecommunications systems, and understand the process of evolution of gene sequences.

    Slide 6

  • Random Experiments (Ghahramani 1.2)Random Experiments (Ghahramani 1.2)In many situations we perform an experiment that has the followingcharacteristics:

    There is a number (which may be infinite) of possible outcomes of theexperiment.

    The actual outcome that occurs depends on influences that cannot bepredicted beforehand.

    Slide 7

  • ExamplesExamples Toss of a coin or die. Spin of a roulette wheel. A horse race. Measurement of the number of phone calls passing through a telephoneexchange in a fixed time period.

    A record of the proportion of people in a survey who approve of theprime minister.

    An observation of whether the greater bilby (macrotis lagotis) is extinct in100 years time.

    Slide 8

  • Outcome SpacesOutcome SpacesAlthough we cannot predict the outcome of a random experiment withcertainty we usually know the set of all possible outcomes.

    Definition: The outcome space or sample space of a random experimentis the set of all its possible outcomes.

    NB: Note that Ghahramani uses S to denote the sample space but we willuse the more common .

    Slide 9

  • ExamplesExamplesToss of a coin.

    = {H,T} where H = head upT = tail up

    Spin of a roulette wheel.

    = {0, 1, . . . , 36}(There are 37 numbers on an Australian roulette wheel.)

    Slide 10

  • A horse race.

    Here the actual experiment needs to be defined more precisely. If weobserve only the winner we might take

    = {all horses in the race}since the winner has to be one of the horses. If we observe the placings wecould take

    = {all possible ordered sets of3 horses in the race}.

    Slide 11

  • More generally, if we observe the whole race we might take

    = {all possible finishing orders}or

    = {all possible finishing orderstogether with times}

    or even = {all possible films of the race}.

    This example illustrates that a given physical situation can lead to differentrandom experiments depending on what we choose to observe.Furthermore, if we have a complex sample space then we can still infersimpler outcomes. We shall discuss this in more detail when we come torandom variables.

    Slide 12

  • Some Further ExamplesSome Further ExamplesSome further examples are

    A coin is tossed until a head occurs and the number of tosses required isobserved: = {1, 2, 3, . . .}.

    A machine automatically fills a one litre bottle with fluid, and the actualquantity of fluid in the bottle is measured in litres: = {q : 0 q 1}.

    A car is filled up with petrol and then driven until it runs out, the distanceit travels is measured in kilometres: = {d : 0 d

  • ExerciseExerciseWrite down appropriate sample spaces for the experiments:

    Measurement of the number of phone calls passing through a telephoneexchange in a fixed time period.

    A record of the proportion of people in a survey who approve of theprime minister.

    An observation of whether the greater bilby (macrotis lagotis) is extinct in100 years time.

    Slide 14

  • EventsEventsFrequently we are not interested in a single outcome, but whether or not oneof a set of outcomes occur. For example, we may have bet on red inroulette. We dont care which red number comes up we just want one ofthem to. This motivates us to define the concept of an event.

    An event is a set of possible outcomes, that is a subset of . We say that theevent A occurs if the observed outcome of the random experiment is one ofthe outcomes that is in the set A.

    Slide 15

  • ExamplesExamplesToss of a die.

    A = {2, 4, 6}is the event that the number on the die is even.

    Spin of a roulette wheel.

    A = {1, 3, 5, 7, 9, 12, 14, 16, 18, 19, 21, 23, 25, 27, 30, 32, 34, 36}is the event red happens.

    B = {1, 2, 3}is the event that one of the first three numbers occurs.

    D = {0}is the event that the number 0 comes up.

    Slide 16

  • EventsEvents

    Since is a set of outcomes, itself is an event. This is known as thecertain event. One of the outcomes in must occur.

    The empty set is also an event, known as the impossible event.

    Slide 17

  • Events are sets and so they are subject to the normal set operations. Thus

    The event A B is the event that A or B or both occur. The event A B is the event that A and B both occur. The event Ac is the event that A does not occur. We write A to say that the outcome is in the event A. We write A B to say that A is a subset ofB. This includes thepossibility that A = B.

    If A is finite (which will often not be the case), we write#A for thenumber of elements of A.

    Slide 18

  • For illustrative purposes, and to gain intuition, the relationship betweenevents is often depicted using a Venn diagram.

    Two events A1, A2 which have no outcomes in common (A1 A2 = )are called mutually exclusive or disjoint events.

    Similarly, events A1, A2, . . . are disjoint if no two have outcomes incommon, that is

    Ai Aj = i )= j.

    Slide 19

  • Two events are exhaustive if they contain all possible outcomes betweenthem,

    A1 A2 = .Similarly, events A1, A2, . . . are exhaustive if their union is the wholesample space,

    i

    Ai = .

    Slide 20

  • ExamplesExamples1. Since A Ac = , A and Ac are disjoint.2. Since A Ac = , A and Ac are exhaustive.3. Throw of a die. Let

    A = {1, 3, 5}, B = {2, 4, 6}, C = {1, 2, 4, 6}, D = {2, 4}Then A and B are disjoint and exhaustive, A and C are exhaustive butnot disjoint and A andD are disjoint but not exhaustive.

    Slide 21

  • Set operations satisfy the distributive laws

    A (B C) = (A B) (A C)A (B C) = (A B) (A C),

    and De Morgans laws

    (A B)c = Ac Bc(A B)c = Ac Bc.

    Slide 22

  • SimulationSimulationSimulation of random experiments is a tool which probabilists often use. Itconsists of performing the experiment on a computer, instead of in real life.This has many advantages.

    It enables us to try out multiple possibilities before going to the expenseof building a system.

    It is possible to perform multiple repetitions of an experiment in a shorttime, so that precise estimates of the behaviour can be derived.

    It is possible to study behaviour of random experiments which are socomplicated that it is hard for us to study them analytically.

    In our computer lab classes, we shall be using simulation.

    Slide 23

  • Defining Probability (Ghahramani 1.1)Defining Probability (Ghahramani 1.1)Up to now we have talked about ways of describing results of randomexperiments an eventA happens if the outcome of the experiment is in theset A. We havent yet talked about ways of assigning a measure to thelikelihood of an event happening.

    That is, we are yet to define what we mean by a probability.

    First let us think about some intuitive notions.

    Slide 24

  • What do we mean when we say The probability that a toss of a coin willresult in heads is 1/2?

    An interpretation that is accepted by most people for practical purposes, thatsuch statements are made based upon some information about relativefrequencies. Our experience suggests that heads comes up about half thetime in coin tosses, so we say the probability of a head is 1/2.

    Similar statements can be made about tossing dice, spinning roulettewheels, arrivals of phone calls in a given time period, etc.

    Slide 25

  • Hence it seems that we can think of a probability as a long term relativefrequency. However there are problems with this interpretation. Consider thestatement

    The probability that horseX will win the Melbourne Cup this year is 1/21.

    A similar statement is

    The probability that macrotis lagotis will be extinct in 100 years is 1/100.

    Slide 26

  • Both of the above-mentioned experiments will be performed only once underunique conditions, so a repetitive relative frequency definition makes nosense.

    One way to think of probability in these experiments is that it reflects the oddsat which a person is willing to bet on an event.

    Thus probability takes on a personal definition: my evaluation of aprobability may not be the same as yours. This interpretation of probability isknown as the Bayesian interpretation.

    Slide 27

  • The way that mathematicians approach such issues is to define precisely thesystem they are studying via sets of axioms, derive results in that system,and then real world interpretations can be made in individual situationswhenever the axioms correspond well to the real world situation.

    The study of mathematical probability is like this. It is based on axioms,under which probabilities behave sensibly.

    Slide 28

  • The definition of what we mean by sensibly is remarkably simple.

    We arbitrarily assign the value 1 to be the probability of the certain eventand require that the probability of any event be nonnegative.

    If A and B are disjoint events, then if A occurs thenB cant, and viceversa. Thus we would expect that

    P (A B) = P (A) + P (B).where we have written P (A) to denote the probability of event A.

    Slide 29

  • Probability axioms/theorems (Ghahramani 1.3,1.4)Probability axioms/theorems (Ghahramani 1.3,1.4)These considerations lead to the following axioms for mathematicalprobability

    1. P (A) 0, for all events A2. P () = 1

    3.

    P (ni=1

    Ai) =ni=1

    P (Ai)

    where {A1, A2, A3, . . . , An} is a set of mutually exclusive events.

    Slide 30

  • In fact, it turns out that we need a slightly stronger version of Axiom 3.Specifically, it has to hold for infinite sequences of mutually exclusive events.Thus, we use

    3.

    P (i=1

    Ai) =i=1

    P (Ai)

    where {A1, A2, A3, . . .} is any sequence of mutually exclusive events.

    Slide 31

  • The infinite version of Axiom 3 immediately implies finite additivity for any n,that is

    P (ni=1

    Ai) =ni=1

    P (Ai)

    where {A1, A2, A3, . . . , An} is a set of mutually exclusive events.We use infinite, rather than finite, additivity because we sometimes need tocalculate probabilities for infinite unions.

    For example, the event that a 6 eventually occurs when tossing a die can beexpressed as

    i=1Ai, where Ai is the event that the 6 occurs for the first

    time on the ith toss.

    Slide 32

  • From the axioms, we can deduce the following properties of the probabilityfunction:

    (4) P () = 0, sinceA = A(5) P (Ac) = 1 P (A), sinceA Ac = (6) A B P (A) P (B), sinceA (Ac B) = B(7) P (A) 1, since A (8) Addition theorem:

    P (A B) = P (A) + P (B) P (A B)

    Slide 33

  • NotesNotes P (.) is a set function. It mapsA [0, 1], whereA denotes the classof events, that is a set of subsets of the outcome space. In a secondsubject on probability, you will see that we cant always takeA to be theset of all subsets of the outcome space, and thatA has to satisfy certainrules. However, we wont worry about these issues in this subject.

    For a discrete outcome space, we can writeP (A) =

    A

    P ({}).

    In general, possible outcomes are allowed to have zero probability, thusP (E) = 0 / E = . Similarly, there can be sets other than thatcan have probability 1.

    Slide 34

  • Evaluating ProbabilitiesEvaluating ProbabilitiesSo far we have said nothing about how numerical values are assigned to theprobability function, just that if we assign values in such a way that theAxioms (1) (3) hold, then the properties (4) (8) will also hold.

    Assigning probabilities to events is a large part of what the subject is about.We are not going to cover it all today.

    However, to give us something concrete to think about, we shall mention thesimplest case when the outcome space is finite and each outcome isequiprobable.

    Slide 35

  • When each outcome is equiprobable and#() = N , it follows easily fromthe axioms that

    P ({}) = 1/Nfor all .Further,

    P (A) = #(A)/N.

    Slide 36

  • ExamplesExamples Toss of one fair coin. Toss of two fair coins. Toss of two fair dice. Three of twenty tyres in a store are defective. Four tyres arerandomly-selected for inspection. What is the probability that a defectivetyre will be included?

    The birthday problem.

    Slide 37

  • Conditional Probability (Ghahramani 3.1)Conditional Probability (Ghahramani 3.1)

    If A andH are two events, and it is known that eventH has occurred, whateffect does this information have on the probability of occurrence of A?

    Slide 38

  • ExampleExampleToss two fair dice. If we know that the first die is a 3, what is the probabilitythat the sum of the two dice is 8?

    The original sample space has 36 outcomes {(1, 1), (1, 2), . . . , (6, 6)}.Given the first die is a 3 there are six outcomes of interest,{(3, 1), (3, 2), . . . , (3, 6)}. Since the dice are fair, each of these outcomeshas the same probability of occurring. Hence, given that the first die is a 3,the probability of the sum being 8 (that is (3,5) occurring) is 1/6.

    Slide 39

  • If A denotes sum of the dice is 8 andH denotes first die is a 3 theprobability we have calculated is called the conditional probability of A givenH and is denoted P (A|H).Let us derive a general formula for P (A|H) valid for all events A andH .

    Slide 40

  • IfH occurs then, in order forA to occur, it is necessary that the outcome bein both A andH , that is it must be in A H . Also, since we knowH hasoccurred,H may be regarded as our new sample space. Hence theprobability ofA givenH is the probability ofAH relative to the probabilityofH , that is

    P (A|H) = P (A H)P (H)

    if P (H) > 0.

    In more advanced work, we also define P (A|H) even when P (H) = 0,but this is beyond the scope of this course.

    Slide 41

  • Multiplication Theorem (Ghahramani 3.2)Multiplication Theorem (Ghahramani 3.2)Sometimes we know P (H) and P (A|H) but not P (A H). If this is thecase we can use the definition of conditional probability to express theprobability of A H , that is

    P (A H) = P (H)P (A|H).

    Slide 42

  • ExamplesExamples Toss a fair die. Let A = {2} andH = {x : x is even}. Toss two fair dice. Let A = {(i, j) : |i j| 1} andH = {(i, j) : i+ j = 7}.

    Slide 43

  • Let us assume that P (A|H) > P (A). That is, it is more likely that A willoccur if we know thatH has occurred than if we know nothing aboutH .Then we have

    P (H |A) = P (A H)P (A)

    >P (A H)P (A|H)

    =P (A H)P (H)

    P (A H)= P (H)

    That is if P (A|H) > P (A), then P (H |A) > P (H). A similar argumentshows that if P (A|H) < P (A), then P (H |A) < P (H).

    Slide 44

  • We say that there exists a positive relationship between A andH ifP (A|H) > P (A) and a negative relationship between A andH ifP (A|H) < P (A).If there is a positive relationship between A andH , then the occurrence ofH will increase the chance ofA occuring. If there is a negative relationshipbetween A andH , then the occurrence ofH will decrease the chance of Aoccuring.

    Qu: What about the situation when P (A|H) = P (A)?

    Slide 45

  • ExampleExampleYou are one of seven applicants for three jobs.

    Consider the random experiment that occurs when the decision is made tooffer the jobs to three applicants. The outcome space consists of allcombinations of three appointees from seven applicants. There are(

    7

    3

    )=

    7.6.5

    3.2.1= 35

    of these.

    Slide 46

  • Another one of the applicants (Ms X) and you are the only applicants with aparticular skill. You think that it is likely that the employer will want the skilland employ one of you, but unlikely that both of you will be employed.Specifically you make the assessments listed on the following slide about thelikelihood of the various possible combinations.

    Slide 47

  • the five combinations where both you and Ms X get the job haveprobability 1/60,

    the ten combinations where you get the job and Ms X doesnt haveprobability 1/24,

    the ten combinations where Ms X gets the job and you dont haveprobability 1/24, and

    the ten combinations where neither of you get the job have probability1/120.

    Slide 48

  • You find out on the grapevine that Ms X has not got the job. How does yourassessment of the probability that you will get the job change?

    Solution

    Let the A be the event you are selected and the eventH Ms X is notselected.

    More correctly A is the subset of that consists of those outcomes in whichyou are selected andH is the subset of that consists of those outcomes inwhich Ms X is not selected.

    Slide 49

  • ThenP (A) =

    5

    60+

    10

    24=

    1

    2,

    P (H) =10

    120+

    10

    24=

    1

    2,

    andP (A H) = 10

    24.

    Therefore

    P (A|H) = 10/241/2

    = 5/6,

    and so if you know that Ms X did not get the job, your probability of gettingthe job is quite high.

    There is a positive relationship between the events A andH .

    Slide 50

  • Independence of EventsIndependence of Events(Ghahramani 3.5)(Ghahramani 3.5) If P (A|B) > P (A) then P (B|A) > P (B). A and B tend to occurtogether and we call them positively related.

    If P (A|B) < P (A) then P (B|A) < P (B). A and B tend to occurseparately and we call them negatively related.

    If P (A|B) = P (A) then P (B|A) = P (B). A andB dont appear toinfluence each other and we call them independent.

    Slide 51

  • As we have just seen P (A|B) = P (A) and P (B|A) = P (B) arealgebraically equivalent to

    P (A B) = P (A)P (B).This equation is taken as the mathematical definition of the independence oftwo events. It is a special case of the general multiplication theorem

    P (A B) = P (A)P (B|A).Two events that are not independent are said to be dependent.

    Slide 52

  • If A and B are independent events then so are

    Ac and B A andBc

    Ac and Bc

    Slide 53

  • Independence of n > 2 eventsIndependence of n > 2 events(Ghahramani 3.5)(Ghahramani 3.5)Now lets think about extending the idea of independence to more than twoevents. We talk of the mutual independence of n > 2 events. But can wecover all dependencies by checking for pairwise independence of all possiblepairs? Consider the random experiment of tossing two fair coins and thefollowing three events:

    A: First coin isH B: Second coin isH C : Exactly one coin isH

    Slide 54

  • Events A1, A2, . . . , An are said to be independent if for any subcollection{j1, j2, . . . , jm} {1, 2, . . . , n}

    P (Aj1 Aj2 . . . Ajm) = P (Aj1)P (Aj2) . . . P (Ajm)

    Slide 55

  • If the eventsA1, A2, . . . , An are independent then the following derivedcollections of events are also independent

    Ac1, A2, . . . , An Ac1, Ac2, A3, . . . , An A1 A2, A3, . . . , An A1 A2, A3, . . . , An

    Slide 56

  • An important and frequently-applied consequence of independence:

    If the eventsA1, A2, . . . , An are independent then

    P (A1 A2 . . . An) = P (A1)P (A2) . . . P (An)

    NB: The converse is not true. Why?

    This result is particularly useful for analysing n independent repetitions of arandom experiment.

    Slide 57

  • Independence vs ExclusionIndependence vs ExclusionIndependence of events A andB is a different concept from A and B beingmutually exclusive or disjoint. You can test for mutual exclusion simply byinspecting the outcomes inA and B to see if there are any in common, evenbefore any probability function is defined.

    But you cannot test independence without knowing the probabilities.

    Slide 58

  • Unless one or both have probability zero, disjoint eventsA and B cannot beindependent, since

    P (A B) = P () = 0 < P (A)P (B)

    In fact, the events A andB are negatively related: the occurence of Aexcludes the occurence of B.

    Slide 59

  • Network reliabilityNetwork reliabilityWe can calculate the reliability of a network of independent componentsconnected in series and parallel.

    For example imagine a network of interconnected switches which shouldclose in an emergency to set off an alarm. Individual switches fail at randomand independently. For the alarm to sound there must be at least one pathfor current to flow from left to right.

    Slide 60

  • Law of Total ProbabilityLaw of Total Probability(Ghahramani 3.3)(Ghahramani 3.3)A partition of the outcome space is a collection of disjoint and exhaustiveevents (A1, A2, . . .). That is, for all i and j, Ai Aj = and

    i

    Ai = .

    The simplest partition is of the form (A,Ac).

    Slide 61

  • Now, for any eventH ,

    H = H = H (

    i

    Ai)

    =i

    (H Ai)

    where the last equation follows from the distributive law.

    Slide 62

  • Using probability axiom 3 and the multiplication formula, we have

    P (H) =i

    P (H Ai)

    =i

    P (H |Ai)P (Ai).

    Slide 63

  • This brings us to the Law of Total Probability.

    If A1, A2, . . . are disjoint and exhaustive events then, for any eventH ,

    P (H) =i

    P (H |Ai)P (Ai).

    Slide 64

  • The Law of Total Probability is one of the most important equations in appliedprobability. By choosing the right partition, it can be used (in different forms,of course) to set up equations concerning financial markets, queues,teletraffic networks, epidemics, genetics and computer systems, to name justa few applications.

    Slide 65

  • Bayes Formula (Ghahramani 3.4)Bayes Formula (Ghahramani 3.4)Sometimes we want to write P (H |A) in terms of P (A|H). This is easilydone from the definition of P (H |A). We have

    P (H |A) = P (A H)P (A)

    =P (A H)P (H)

    P (H)

    P (A)

    =P (A|H)P (H)

    P (A).

    which is known as Bayes Formula.

    Slide 66

  • Use of Bayes Formula can yield some counter-intuitive facts.

    Example Suppose a test for HIV is 90% effective in the sense that if aperson is HIV positive then the test has a 90% chance of saying that they areHIV positive. If they are not positive assume there is still a 5% chance thatthe test says that they are.

    Let A be the event that the test says that a person is HIV positive, andH bethe event that the person actually is HIV positive. Then we have

    P (A|H) = .9, P (A|Hc) = .05.

    Slide 67

  • Now suppose a person tests positive for HIV. What is the probability theyactually are HIV positive.

    From Bayes Formula:

    P (H |A) = P (A|H)P (H)P (A)

    =.9 P (H)

    P (A).

    Slide 68

  • Now, since not many of the population is HIV positive P (H) is likely to besmall, say .0001.

    We can calculate P (A) using the Law of Total Probability.

    P (A) = P (A|H)P (H) + P (A|Hc)P (Hc)= .9 .0001 + .05 .9999= .050085

    and soP (H |A) = .9 .0001

    .050085= .0018.

    Slide 69

  • Under these numbers a person is unlikely to be HIV positive even if the testsays that they are. Such phenomena are well known in the epidemiologicalliterature. Tests for rare diseases have to be very accurate.

    Slide 70

  • A version of Bayes Formula that is often used comes by combining it with theTheorem of Total Probability. Thus we have

    Bayes Formula (2)

    Let A1, A2, . . . be a set of disjoint and exhaustive events. Then for an eventH ,

    P (Ai|H) = P (H |Ai)P (Ai)j

    P (H |Aj)P (Aj).

    Slide 71

  • ProofProofFrom Bayes Formula

    P (Ai|H) = P (H |Ai)P (Ai)P (H)

    .

    Substitution ofP (H) =

    j

    P (H |Aj)P (Aj)

    from the Law of Total Probability gives the result.

    Slide 72

  • Example (Multiple Choice Exams)Example (Multiple Choice Exams)Consider a multiple choice exam that hasm choices of answer for eachquestion. Assume that the probability that a student knows the correctanswer to a question is p. A student that doesnt know the correct answermarks an answer at random. Suppose that the answer marked to a particularquestion was correct. What is the probability that the student was guessing?

    Slide 73

  • SolutionSolutionHere we have the disjoint and exhaustive events

    B1 the student knew the correct answer B2 the student was guessing

    and the observed event A the correct answer was marked.

    Slide 74

  • The conditional probabilities are

    P (A|B1) = 1, P (A|B2) = 1/mWe want to find

    P (B2|A) = P (A|B2)P (B2)P (A|B1)P (B1) + P (A|B2)P (B2)

    =1/m (1 p)

    p+ 1/m (1 p)

    =1 p

    mp+ 1 p .

    Slide 75

  • Random Variables (Ghahramani 4.1)Random Variables (Ghahramani 4.1)In many random experiments we are interested in some function of theoutcome rather than the actual outcome itself.

    For instance, in tossing two dice (as in Monopoly) we may be interested inthe sum of the two dice (e.g. 7) and not in the actual outcome (e.g. (1,6),(2,5), (3,4), (4,3), (5,2) or (6,1)). also, f we had a film of a horse race, thenwe would be able to determine the winner. We can think of the winner as afunction of the film .

    In these cases we wish to assign a real number x to each outcome in thesample space . That is

    x = X()

    is the value of a functionX from to the real numbers R.

    Slide 76

  • DefinitionDefinitionConsider a random experiment with sample space . A functionX whichassigns to every outcome a real numberX() is called a randomvariable.

    NB. In more advanced courses, there are some restrictions on the functionX , but we wont worry about them here.

    Slide 77

  • The terminology random variable is unfortunate becauseX is neitherrandom nor a variable. However it is universally accepted.

    It is standard to denote random variables by capital lettersX,Y etc. andthe values that they take by lower case letters x, y etc.

    We shall denote the set of possible values ofX by SX R (this differsfrom the notation in Ghahramani).

    Slide 78

  • ExampleExampleSuppose we toss two coins. The sample space is

    = {(H,H), (H,T ), (T,H), (T, T )}.LetX(), be the number of heads in . Then

    X((H,H)) = 2

    X((H,T )) = X((T,H)) = 1

    X((T, T )) = 0.

    Here SX = {0, 1, 2}.

    Slide 79

  • X is not necessarily a 1-1 function. Different values of may lead to thesame value ofX(). For example,X((H,T )) = X((T,H)).

    The setsA2 = { : X() = 2} = {(H,H)}A1 = { : X() = 1} = {(T,H), (H,T )}A0 = { : X() = 0} = {(T, T )}

    are subsets of and hence are events of the random experiment.

    Slide 80

  • So we can see that a probability function defined on the events of theexperiment indirectly leads to a distribution of probabilities across thepossible values of the random variable. We formalise this as follows.

    Definition: Consider a random experiment with sample space . LetX bea random variable defined on . Then, for x SX the probability thatX isequal to x, denoted P (X = x), is the probability of the eventAx { : X() = x}. Thus

    P (X = x) = P (Ax).

    Slide 81

  • Consequently we can think of statements involving random variables as aform of shorthand. For example,

    X = x for { : X() = x}. X x for { : X() x}. x < X y for { : x < X() y}.

    This shorthand reflects a shift in our interest from the random experiment(, P ) as a whole towards the distribution of the random variable of interest(X,P (X = x)).

    Slide 82

  • ExampleExampleToss two dice. The sample space is

    = {(1, 1), . . . , (6, 6)}.LetX denote the random variable whose value is the sum of the two faces.Assuming each outcome in is equally likely,

    P (X = 2) = P ({ : X() = 2}) = P ({(1, 1)}) = 1/36,P (X = 3) = P ({ : X() = 3}) = P ({(1, 2), (2, 1)}) = 2/36,P (X = 4) = P ({ : X() = 4}) = P ({(1, 3), (2, 2), (3, 1)}) = 3/36,and so on. Of course other random variables may be of interest, for examplethe minimum or maximum number showing.

    Slide 83

  • DefinitionDefinitionA set is said to be countable if it is either finite or can be put into a 1-1correspondence with the set of natural numbers {1, 2, 3, . . .}. That is, a setis countable if it is possible to list its elements in the form x1, x2, . . ..Otherwise a set is uncountable.

    Slide 84

  • N = {1, 2, . . .}Z = {0, 1,1, 2,2, . . .}

    Z Z = {(0, 0), (0, 1), (1, 0), (0,1), (1, 0), . . .}are all countable sets.

    It is known, via a very elegant proof, that [0, 1] and R are uncountable.

    Slide 85

  • Discrete Random VariablesDiscrete Random Variables(Ghahramani 4.3, 4.2)(Ghahramani 4.3, 4.2)

    For a discrete random variable, the set of possible values SX is countable.That is,X can take only a countable number of values.

    Slide 86

  • DefinitionDefinitionLetX be a discrete random variable. The probability mass function (pmf)pX(x) ofX is the function from SX to [0, 1] defined by

    pX(x) = P (X = x).

    You can think of the pX(x) as discrete masses of probability assigned toeach possible x SX .In the above, x is a dummy variable: we could use t or or or anythingelse. However, it is common to use x as a reminder thatX is the randomvariable, and if it is clear that the pmf ofX is intended, the subscriptX maythen be omitted.

    Slide 87

  • We talk about the probability mass function (pmf) determining the probabilitydistribution (or just distribution for short) of the discrete random variableX .

    Note in Ghahramani, the pmf is defined on the domain R, but as pX(x) = 0for all x / SX we prefer to restrict the domain to SX .

    Slide 88

  • ExampleExampleLetX be the sum of the numbers shown on the toss of two fair dice. ThenSX = {2, . . . , 12} and pX(x) is given by

    pX(2) = P (X = 2) = 1/36 pX(8) = 5/36

    pX(3) = P (X = 3) = 2/36 pX(9) = 4/36

    pX(4) = 3/36 pX(10) = 3/36

    pX(5) = 4/36 pX(11) = 2/36

    pX(6) = 5/36 pX(12) = 1/36.

    pX(7) = 6/36

    Slide 89

  • Theorem : The probability mass function pX(x) of a discrete randomvariableX satisfies the following

    1. pX(x) 0 x.

    2.xSX

    pX(x) = 1.

    Indeed any function satisfying (1) and (2) can be thought of as the pmf forsome random variable.

    Slide 90

  • ProofProofPart (1) is obvious as:

    p(x) = P (X = x)

    = P ({ : X() = x})and 0 P (A) 1 for all events A.

    Slide 91

  • For (2) first note that for x1 )= x2, the events

    { : X() = x1}and { : X() = x2}

    are disjoint. So

    P (X = x1 or x2) = P ({ : X() = x1 or x2})= P ({ : X() = x1} { : X() = x2})= P ({ : X() = x1}) + P ({ : X() = x2})= P (X = x1) + P (X = x2).

    Slide 92

  • As SX is the set of possible values ofX() for , it follows thatP ({ : X() SX}) = P () = 1

    but also

    P ({ : X() SX}) =xSX

    P ({ : X() = x})

    =xSX

    P (X = x)

    Hence xSX

    P (X = x) = 1.

    Slide 93

  • From the proof we can see that, for any setB R, given the pmf, we cancompute the probability thatX B via

    P (X B) =

    xBSXpX(x).

    In particularP (X x) =

    yx

    pX(y).

    Slide 94

  • ExampleExampleSuppose that the discrete random variableX has pmf given by:

    x 1 2 3 4 5

    pX(x) 2 3 4 5pX(x) = 1 +2+3+4+5 = 15 = 1 = 1/15.

    P (2 X 4) = 2/15 + 3/15 + 4/15 = 9/15.

    Slide 95

  • Distribution function (Ghahramani 4.2)Distribution function (Ghahramani 4.2)Definition: LetX be a discrete random variable. The distribution functionFX(x) ofX is the function from R to [0, 1] defined by

    FX(x) = P (X x).Particularly in the statistical literature, the distribution function is sometimesreferred to as the cumulative distribution function (cdf).

    Slide 96

  • Properties of the distribution function(Ghahramani 4.2)Properties of the distribution function(Ghahramani 4.2)1. 0 FX(x) 1, since it is a probability.2. FX() = 0, FX() = 1, from the definition of the distribution

    function.

    3. P (a < X b) = FX(b) FX(a), if a < b since:{X a} {a < X b} = {X b}

    and the events on the LHS are mutually exclusive.

    4. FX(x) is non-decreasing. This follows from Property 3, since if b > a,then FX(b) FX(a) = P (a < X b) 0.

    Slide 97

  • 5. FX() is continuous on the right, that is limh0 FX(x+ h) = FX(x).Using Property 3, if h > 0, then

    [FX(x+ h) FX(x)] = P (x < X x+ h)As h 0, {x < X x+ h} and so the probability on the righthand side approaches zero.

    6. P (X = x) is the jump in FX at x. That isP (X = x) = FX(x) limh0 FX(x h). Again if h > 0, then

    FX(x) FX(x h) = P (x h < X x)As h 0, {x h < X x} X = x.

    Slide 98

  • Continuous random variablesContinuous random variables(Ghahramani 6.1, 4.2)(Ghahramani 6.1, 4.2)If SX , the set of possible values ofX is uncountable we callX a continuousrandom variable. It is not possible to assign probability masses directly toevery possible value of a continuous random variable.

    We deal with this by assigning probabilities to intervals.

    In fact, the definition of the distribution function is exactly the same as it wasfor a discrete random variable.

    Slide 99

  • Distribution functionDistribution functionDefinition

    LetX be a continuous random variable. The distribution function FX(x) ofX is the function from R to [0, 1] defined by

    FX(x) = P (X x).

    Slide 100

  • Probability density functionProbability density functionLetX be a continuous random variable. A function fX(x) which is suchthat, for all x R, x

    fX(y)dy = FX(x)

    is called a probability density function (pdf) ofX . If such a function exists,then it is unique. In this case, there is no value of x SX to which it ispossible to assign a probability mass.

    If a pdf exists then, for almost all values of x, we also have

    dFX(x)

    dx= fX(x).

    Slide 101

  • Properties of the pdfProperties of the pdf1. fX(x) 0 since FX(x) is non-decreasing.2. ba fX(t)dt = FX(b) FX(a) = P (a < X b), that is probabilityis represented by the area under the graph of fX(x). For a randomvariable that has a density function

    P (a < X b) = P (a X < b)= P (a X b) = P (a < X < b)

    since the end points have zero probability.

    3. fX(t)dt = 1 since FX() = 1 and FX() = 0.

    Note that properties 1 and 3 are sufficient for fX to be a pdf.

    Slide 102

  • Discrete vs Continuous With No MassesDiscrete vs Continuous With No MassesDiscrete Continuous

    pmf pX(x) pdf fX(x)

    prob. masses no masses

    pX(x) at x pX(x) = 0 xxSX pX(x) = 1

    fX(t)dt = 1

    P (X I) =xI pX(x) P (X I) = ba fX(t)dt0 pX(x) 1 0 fX(x)

    where I = [a, b].

    Slide 103

  • There is no need to have fX(x) 1 since areas, not the value of thedensity function, represent probabilities. Thus, for example

    fX(x) =

    10

    6, 0 x 1060, otherwise

    is a pdf since fX(x) 0 and fX(x)dx = 1.

    Slide 104

  • Pdf interpretationPdf interpretationWhilst the value of the pdf fX(x) is not the probability at x it can beinterpreted as a probability density around x.

    Define {X x} to mean {x 12x < X x+ 12x}. Then

    P (X x) = P (x 12x < X x+ 1

    2x)

    = FX(x+1

    2x) FX(x 1

    2x)

    =

    x+ 12 xx 12 x

    fX(u)du

    fX(x)x.So almost everywhere we have P (X x) fX(x)x.

    Slide 105

  • It follows that fX(x) gives a measure of the relative probability of anobserved value near x, in the sense that

    P (X x1)P (X x2)

    fX(x1)

    fX(x2).

    Slide 106

  • Unlike the pmf and pdf, the distribution function can be used equally well todescribe both discrete and continuous random variables. In fact it alsodescribes random variables that are essentially continuous, but have massesat some points.

    Example Assume you walk into a bank. LetX be the time that you have towait for a teller. If there is no queue, which happens with some positiveprobability, thenX = 0. OtherwiseX is a continuous random variable.

    Slide 107

  • Distribution functions can clearly take many forms: for a discrete randomvariable, the distribution function is a step function. For a random variablethat has no point masses of probability, the distribution function iscontinuous. If the random variable has some (at most a countable number)point masses then the distribution function is a function with a discontinuitycorresponding to each jump.

    Slide 108

  • The Story So FarThe Story So FarIt is important to make sure that you are fully aware of the subtle distinctionsbetween probability measures, random variables, probability mass anddensity functions and cumulative distribution functions. Now that we haveseen them all, we will take a moment to review them and point out thedifferences. Consider a random experiment with sample space .

    Slide 109

  • Then

    1. The probability measure P maps the set of events (that is subsets of )to [0, 1].

    2. A random variableX maps to R.

    3. For a discrete random variable, the probability mass function maps theset of possible values SX to [0, 1].

    4. The distribution function maps R to [0, 1].

    5. For a continuous random variable with no point masses, the probabilitydensity function maps R to [0,).

    Slide 110

  • We often talk about random variables and their probability mass anddistribution functions without explicit reference to the underlying samplespace. For example, in talking about an experiment in which a coin is tossedn times we may define the random variableX to be the number of headsthat turns up, and then go on to talk about the probability mass anddistribution functions ofX (which are?).

    This is an example of shorthand expression which mathematicians often use.However they only use it when they fully understand the situation. In a casesuch as that described above, it is understood that the underlying samplespace is the set of sequences ofH and T of length n without this facthaving to be mentioned explicitly.

    Slide 111

  • Expectation (Ghahramani 4.4)Expectation (Ghahramani 4.4)The distribution function contains all the information about the distribution of arandom variable. However, this information can be difficult to digest. Becauseof this, we often summarise the information by reducing it in some way.

    The most common such measure is the expected value.

    Slide 112

  • The concept of expectation first arose in gambling problems: Is a particulargame a good investment? Consider the game where the winnings $W haspmf

    w 1 1 10

    P (W = w) 0.75 0.20 0.05

    Is it worthwhile? If you played the game 1000 times, you would expect to lose$1 about 750 times, to win $1 about 200 times and to win $10 about 50times. Thus you will win about

    $

    (1 750 + 1 200 + 10 501000

    )per game.

    Slide 113

  • Your expected winnings are5 cents per game. We say that the expectedvalue ofW is -0.05.

    This gives an indication of the worth of the game: in the long run, you canexpect to lose an average of about 5 cents per game.

    Slide 114

  • Expectations of Discrete RVsExpectations of Discrete RVs(Ghahramani 4.4)(Ghahramani 4.4)LetX be a discrete random variable with possible values in the set SX , andprobability mass function pX(x).

    The expected value or mean ofX , denoted by E[X], is defined by

    E[X] =xSX

    xpX(x)

    provided the sum on the right hand side converges absolutely.

    Note: Ghahramani uses A rather than SX . We prefer the latter as A is oftenused to denote events.

    Slide 115

  • ExampleExampleFind E[X] ifX is the value of the upturned face after a toss of a fair die.

    Solution SX = {1, , 6} andpX(1) = pX(2) = = pX(6) = 1/6.

    Hence

    E[X] = 1

    (1

    6

    )+ 2

    (1

    6

    )+ 3

    (1

    6

    )+ . . .+ 6

    (1

    6

    )=

    7

    2.

    Slide 116

  • E[X] is not necessarily a possible value ofX . It isnt in the example. Wecan never get 7/2 to show on the face of a die. However, if we toss a die ntimes and let xi denote the result of the ith toss, then we would expect that

    limn

    1

    n

    ni=1

    xi = E[X]

    Thus, after a large number of tosses we expect the average of all the valuesofX to be close to E[X].

    Slide 117

  • More generally, suppose any random experiment is repeated a large numberof times, and the random variableX observed each time, then the averageof the observed values should approximately equal E[X].

    Another way to think of the expected value of a random variable is as thelocation of the centre of mass of its probability distribution.

    We often denote the expected value by or X to be clear which randomvariable is involved.

    Slide 118

  • ExampleExampleA manufacturer produces items of which 10% are defective and 90% arenon-defective. If a defective item is produced the manufacturer loses $1,while a non-defective item yields a profit of $5. IfX is the profit on a singleitem, thenX is a random variable whose expected value is

    E[X] = 1(0.1) + 5(0.9) = $4.40.For any given item the manufacturer will either lose $1 or make $5. Theinterpretation of E[X] is that if the manufacturer makes a lot of items he orshe can expect to make an average $4.40 per item.

    Slide 119

  • Expectations of Continuous RVsExpectations of Continuous RVs(Ghahramani 6.3)(Ghahramani 6.3)The definition of the expected value of a continuous random variable isanalogous to that for a discrete random variable.

    LetX be a continuous random variable with possible values in the set SX ,and probability density function fX(x).

    The expected value or mean ofX , denoted by E[X] is defined by

    E[X] =

    xSX

    xfX(x)dx

    provided the integral on the right hand side converges absolutely.

    Slide 120

  • The connection with the definition of the expected value of a discrete randomvariable can be seen by approximating the integral with a Riemann sum.Divide SX up into n intervals of length x. Then

    E[X] =

    xSx

    xfX(x)dx

    ni=1

    xifX(xi)x

    ni=1

    xiP (xi X < xi + x)

    Slide 121

  • ExampleExampleIfX has pdf fX(x) = 12x2(1 x) (0 < x < 1), then

    E[X] =

    10

    12x3(1 x)dx

    = 12

    [x4

    4 x

    5

    5

    ]10

    = 12 1/20= 3/5.

    Slide 122

  • Expectation of functionsExpectation of functions(Ghahramani 4.4, 6.3)(Ghahramani 4.4, 6.3)In many situations we are interested in calculating the expected value of afunction (X) of a random variableX . This is easy to do.

    Theorem

    IfX is a discrete random variable with set of possible values SX andprobability mass function pX(x), then, for any real-valued function ,

    E[(X)] =xSX

    (x)pX(x)

    provided the sum converges absolutely.

    Slide 123

  • ExampleExampleLets return to the toss of a fair die withX the number on the upturned face

    E[X2] = 12(1

    6

    )+ 22

    (1

    6

    )+ . . .+ 62

    (1

    6

    )=

    91

    6.

    Note that E[X2] =91

    6but E[X]2 =

    (7

    2

    )2=

    49

    4.

    Slide 124

  • Theorem

    IfX is a continuous random variable with set of possible values SX andprobability density function fX(x), then, for any real-valued function ,

    E[(X)] =

    xSX

    (x)fX(x)dx

    provided the integral converges absolutely.

    Slide 125

  • ExampleExampleIfX has pdf fX(x) = 2x (0 < x < 1), then

    E(X) =

    10x 2xdx = 2

    3,

    E

    [1

    X

    ]=

    10

    1

    x 2xdx = 2.

    Note that

    E

    [1

    X

    ])= 1E[X]

    =3

    2.

    Slide 126

  • Generally, E[(X)] )= (E[X]), with one important exception, when isa linear function.

    Theorem IfX is a random variable and a and b are constants, then

    E[aX + b] = aE[X] + b.

    ProofWe shall do the discrete case. The continuous is similar.

    E[aX + b] =xSX

    (ax+ b)pX(x)

    =xSX

    axpX(x) +xSX

    bpX(x)

    = a

    xSX xpX(x) + b

    xSX pX(x)

    = aE[X] + b.

    Slide 127

  • Variance (Ghahramani 4.5, 6.3)Variance (Ghahramani 4.5, 6.3)One particular function of a random variable gives rise to a measure of thespread of the random variable.

    Definition The variance V (X) or Var(X) of a random variableX isdefined by

    V (X) = E[(X E[X])2].V (X) measures the consistency of outcome a small value of V (X)implies thatX is more often near E[X], whereas a large value of V (X)means thatX varies around E[X] quite a lot.

    Slide 128

  • ExampleExampleConsider the batting performance of two cricketers, one of whom hits acentury (exactly 100) with probability 1/2 or gets a duck (0) with probability1/2. The other scores 50 every time.

    LetX1 be the random variable giving number of runs scored by the firstbatsman andX2 the number of runs scored by the second batsman. Then

    E[X1] =12 0 + 12 100 = 50,

    E[X2] = 1 50 = 50.

    Slide 129

  • HoweverV (X1) =

    12 (0 50)2 + 12 (100 50)2

    = 12 (2500) +12 (2500)

    = 2500,

    V (X2) = 1 (50 50)2 = 0,which reflects the fact that the second batsman is more consistent.

    Slide 130

  • From the definition of the variance we can see that the more widespread thelikely values ofX , the larger the likely values of (X )2 and hence thelarger the value of V (X).

    This is why the variance is a measure of spread. We often denote thevariance by 2, or 2X to be clear which random variable is involved.

    The square root of V (X) is called the standard deviation and is denoted byX , sd(X) or just if the random variable involved is clear. As the units ofthe standard deviation and the random variable are the same, spread is oftenmeasured in standard deviation units.

    Slide 131

  • There are alternative measures of spread.

    For example the mean deviation, d = E(|X |). However for variousmathematical reasons the variance (and standard deviation) are by far themost frequently used. One fundamental reason is that, whenX and Y areindependent, the variance is additive.

    In fact additivity holds for a broader class of random variables, as we shallsee when we study covariance and correlation.

    Slide 132

  • Notes on varianceNotes on variance1. V (X) 0 since (X )2 0.2. V (X) = 0 P (X = ) = 1.3. If Y = aX + b, then V (Y ) = a2V (X) and sd(Y ) = |a| sd(X).4. IfX has mean and variance 2,Xs = X has mean 0 and

    variance 1. Xs is called a standardised random variable.

    5. The mean and variance do not entirely determine the distribution - theyjust give some idea of location and spread.

    Slide 133

  • TheoremTheoremV (X) = E[X2]E[X]2.Proof:

    V (X) = E[(X E[X])2]= E[X2 2XE[X] +E[X]2]= E[X2] 2E[XE[X]] + E[E[X]2]= E[X2] 2E[X]E[X] + E[X]2= E[X2] 2E[X]2 +E[X]2= E[X2]E[X]2.

    Slide 134

  • ExampleExampleCalculate V (X) whereX represents the roll of a fair die.

    Solution We saw before that

    E[X] =7

    2

    E[X2] =91

    6

    and soV (X) =

    91

    6 49

    4=

    35

    12.

    Slide 135

  • Higher moments of a random variableHigher moments of a random variable(Ghahramani 4.5, 11.1)(Ghahramani 4.5, 11.1)We will consider these in more detail later in the course but simply note thesedefinitions at this point:

    The kth moment (about the origin) of a random variableX is given byk = E(Xk).

    The kth central moment (about the mean) of a random variableX is givenby k = E

    ((X )k).

    So the mean E(X) is the first moment ofX and the variance V (X) is thesecond central moment ofX .

    Slide 136

  • Special Probability DistributionsSpecial Probability Distributions(Ghahramani, Ch. 5 & 7)(Ghahramani, Ch. 5 & 7)Certain classes of random experiment and random variables defined uponthem turn up so often that we define and name standard functions as theirdistribution functions, pmfs or pdfs as appropriate.

    We shall now discuss the most common examples of such random variables.

    Slide 137

  • Discrete random variablesDiscrete random variablesSome discrete distributions which arise frequently in modelling real worldphenomena are:

    Bernoulli Binomial Geometric Negative Binomial Hypergeometric Poisson Uniform

    Slide 138

  • Continuous random variablesContinuous random variablesSome continuous distributions which arise frequently in modelling real worldphenomena are:

    Uniform Normal Exponential Gamma

    Slide 139

  • Bernoulli Random VariablesBernoulli Random Variables(Ghahramani 5.1)(Ghahramani 5.1)The most basic family of random variables consists of thoseX that can takeon just the two values 0 and 1.

    These values might represent any aspect of a random experiment that is adichotomy, such as {success, failure}, {right, wrong}, {true, false} or{head, tail}. A random experiment in which such a dichotomy is observed iscalled a Bernoulli trial.

    Slide 140

  • The random variableX is known as a Bernoulli random variable. If p is theprobability of a success, then the probability mass function is

    p(0) = 1 pp(1) = p.

    The value p [0, 1] is a parameter. By varying it, we get different membersof the family of Bernoulli random variables.

    We sayX has a Bernoulli distribution with parameter p.

    Slide 141

  • Bernoulli mean and varianceBernoulli mean and varianceApplying our formulae for expectations we have

    E(X) = 0 (1 p) + 1 p = p.E(X2) = 02 (1 p) + 12 p = p.V (X) = E(X2) E(X)2 = p p2 = p(1 p).

    Slide 142

  • Many random experiments take the form of sequences of Bernoulli trials.Moreover the physical processes that govern the experiments are such thatwe judge that it is reasonable to assume that the outcomes of the Bernoullitrials are independent.

    Thus, for example, the Bernoulli trials might be tosses of separate coins, orthe same coin at different time points.

    The Bernoulli, Binomial, Geometric and Negative Binomial random variablesall arise in the context of a sequence of independent Bernoulli trials. Theyeach summarise different aspects of the observed sequence of successesand failures.

    Slide 143

  • Binomial random variablesBinomial random variables(Ghahramani 5.1)(Ghahramani 5.1)Consider a sequence of n independent Bernoulli trials withp = P (success). The sample space for such an experiment could betaken to be the set of all sequences of the form

    = S S F F S S F . . . S n letters

    and the probability of any given sequence occurring is

    P ({}) = pno. of successes (1 p)no. of failures ()

    Slide 144

  • However we usually arent interested in the precise sequence that comesup. More interesting is the total number of successes.

    Define the random variableN() = No. of successes in .

    To find the pmf pN ofN , we need P (N = k), which is the probability of theevent:

    Ak = { : N() = k} k = 0, 1, . . . , nFrom () above we know that the probability of any given Ak is

    pk(1 p)nk

    Thus to get P (Ak) all we need to do is count how many there are inAk .

    Slide 145

  • This is known to be n

    k

    = n!

    k!(n k)! .

    So we have

    pN (k) = P (N = k) = P (Ak) =(nk

    )pk(1 p)nk.

    Slide 146

  • Binomial distributionBinomial distributionIfX = no. of successes in n independent Bernoulli trials withp = P (success) then

    pX(x) = P (X = x) = (nx)p

    x(1 p)nx, x = 0, 1, 2, . . . , nand we sayX has a Binomial distribution with parameters n and p andwriteX d= Bi(n, p).

    Note that we can then writeX d= Bi(1, p) for a Bernoulli distribution.

    Slide 147

  • Using the Binomial Theorem we can verify thatn

    x=0

    pX(x) =n

    x=0

    (nx)px(1 p)nx

    = (p+ 1 p)n= 1n = 1

    Note: The Binomial Theorem is very important in many areas ofmathematics and is certainly something you should know. It states that forinteger n

    (a+ b)n =n

    k=0

    (nk

    )akbnk .

    Slide 148

  • Binomial mean and varianceBinomial mean and varianceApplying our formulae for expectations we can deduce that

    E(X) = np.

    E(X(X 1)) = n(n 1)p2.V (X) = E(X(X 1)) +E(X) E(X)2

    = np(1 p).

    Slide 149

  • Example - Tay-Sachs diseaseExample - Tay-Sachs diseaseThis is a hereditary metabolic disorder caused by a recessive genetic trait.When both parents are carriers their child has probability 14 of being bornwith the disease (Mendels first law).

    If such parents have four children, what is the probability distribution forX =no. of children born with the disease ?

    Slide 150

  • Binomial distribution shapeBinomial distribution shapeWe can show that the ratio of successive binomial probabilities r(x) satisfies

    r(x) =pX(x)

    pX(x 1) =n+1x 11p 1

    x = 1, 2, . . . , n

    which decreases as x increases.

    If x < p(n+ 1) then r(x) > 1 and the pmf is increasing.If x > p(n+ 1) then r(x) < 1 and the pmf is decreasing.

    So the Binomial distribution only has a single peak.Exercise: Find values of n and p so that two successive binomialprobabilities are the same.

    Slide 151

  • Sampling with replacementSampling with replacementSuppose that a population consists ofN objects, a proportion p of which aredefective. A sample of n is obtained by selecting one object at random fromthe population, replacing it, selecting at random again, and so on.

    We therefore have a sequence of independent Bernoulli trials withP (success) = p.

    Thus ifX = number of defectives obtained in the sample, thenX

    d= Bi(n, p).

    Slide 152

  • Geometric random variablesGeometric random variables(Ghahramani 5.3)(Ghahramani 5.3)Motivating Example: Consider a driver who drives through traffic lights untilstopped by a red light. Let Ai be the event that the ith light is red and let

    P (Ai) = p.

    Assume that the Ai are independent events.

    The sample space of the experiment can be taken to be the set of sequencesof the form

    = G G G G G . . . G n

    R.

    Slide 153

  • Note that there are infinitely many outcomes in the sample space.

    Now let the random variableN denote the number of green lights that thedriver passes through before reaching the red light that finally stops him orher.

    Thus, for defined as above,

    N() = N(G G G . . . G n

    R) = n.

    Exercise: What is the event that the driver arrives at the third light and seesthat it is green?

    Answer: A { : N() 3}.

    Slide 154

  • HenceN is a random variable and

    P (N = 0) = P ({ : N() = 0}) = P (R) = pP (N = 1) = P ({ : N() = 1}) = P (GR) = (1 p)pP (N = 2) = P ({ : N() = 2}) = P (GGR) = (1 p)(1 p)p

    ...

    P (N = n) = (1 p)(1 p) . . . (1 p)p.So the pmf forN is

    pN (n) = P (N = n) = (1 p)np n = 0, 1, 2, . . .N takes a countable number of values and is therefore a discrete randomvariable.

    Slide 155

  • As a check look atn=0

    P (N = n) =n=0

    (1 p)np

    =p

    1 (1 p)= 1.

    Note: Remember that if |x| < 1 thenn=0

    xn =1

    1 x.

    This is the formula for the sum of the geometric series and is a basicmathematical fact worth remembering.

    Slide 156

  • Travelling through the traffic lights amounts to conducting a sequence ofindependent Bernoulli trials.

    If we define success to mean the light is red then p = P (success).

    In this context we can now think ofN as the number of failures before thefirst success.

    Slide 157

  • Geometric distributionGeometric distributionIfN = no. of failures before the first success in a sequence of independentBernoulli trials with p = P (success) then

    pN (n) = P (N = n) = (1 p)np, n = 0, 1, 2, . . .and we sayN has a Geometric distribution with parameter p and writeN

    d= G(p).

    The Geometric distribution is often thought of as a waiting time before theoccurence of a success. In this context we imagine the Bernoulli trials occurat regularly spaced intervals of time.

    Slide 158

  • Important Note: The Geometric distribution is defined slightly differently insome texts, including Ghahramani, by counting the number of trials until thefirst success rather than the number of failures before it.

    If we defineM = to be the number of trials until the first success then clearlyM = N + 1 whereN d= G(p) under our definition. So instead of takingthe values 0, 1, 2, . . . the random variableM takes values 1, 2, 3, . . .. Thedistribution ofM is simply shifted in location by one unit to the right.

    Slide 159

  • ExampleExampleYou are playing Who wants to be a millionaire. Suppose on the first bank ofquestions you have probability 0.95 of answering correctly. What chance doyou have of winning at least $1000 (ie answering at least 5 questionscorrectly) ?

    Slide 160

  • Geometric mean and varianceGeometric mean and varianceApplying our formulae for expectations we can deduce that

    E(X) =(1 p)

    p.

    V (X) =(1 p)p2

    .

    Slide 161

  • Lack of memory propertyLack of memory propertyA curious property of the geometric distribution is the so called lack ofmemory property. If T d= G(p) then, for t = 0, 1, 2, . . .

    P (T t) = p(1 p)t + p(1 p)t+1 + . . . = p(1 p)t

    p= (1 p)t.

    So for given a, t = 0, 1, 2, . . ., we have:

    P (T a t |T a) = P (T t+ a |T a)=

    P (T t+ a)P (T a)

    =(1 p)t+a(1 p)a = (1 p)

    t

    Slide 162

  • Therefore (for non-negative integer a and t) we have:

    P (T a t |T a) = P (T t)Hence given that the first a trials were all failures, the residual time T atill the first success will have the same G(p) distribution as the original T .

    The information that there has been no successes in the past a trials has noeffect on the future waiting time to a success: the process forgets thepast has no effect on the future.

    Slide 163

  • Negative Binomial random variablesNegative Binomial random variables(Ghahramani 5.3)(Ghahramani 5.3)We start with a natural extension of our motivating example for the Geometricdistribution. Our driver reaches his first red traffic light. He then continues onhis journey, travelling through green lights until he reaches a second red light.Lets say the journey continues like this until he reaches the rth red light.

    Then we have Z , the total number of green lights before the rth red light isequal toN1 +N2 + . . .+Nr whereNi is the number of green lightsbetween the (i 1)st red light and the ith red light andNi

    d= G(p) i = 1, 2, . . . , r.

    So Z is the sum of r independent Geometric random variables. We want tofind the distribution of Z .

    Slide 164

  • We first observe that one way in which the event Z = z can occur isF F . . . F S S . . . S S

    z r 1The probability of this sequence is (1 p)zpr . If the first z + r 1 resultsare arranged amongst themselves, leaving the final rth S, the event Z = zstill occurs. This can be done in(

    z + r 1r 1

    )ways, and for each arrangement, the probability is (1 p)zpr .

    Slide 165

  • The pmf of Z is therefore given by

    pZ(z) = P (Z = z) =

    (z + r 1r 1

    )pr(1 p)z, z = 0, 1, 2, . . .

    This formula requires r to be an integer, but it turns out that we can removethis restriction. Recall that when both x and k are nonnegative integers wedefine (

    x

    k

    )=

    x!

    (x k)!k! .We now extend this definition all real x and nonnegative integer k asfollows: (

    x

    k

    )=x(x 1) . . . (x k + 1)

    k!.

    Slide 166

  • Note: We define(x0

    )= 1 for all real x.

    If r happens to be an integer we can show that(rz

    )= (1)z

    (z + r 1r 1

    ).

    This suggests using

    pZ(z) =

    (rz

    )pr(p 1)z, z = 0, 1, 2, . . .

    as an alternative way of writing the pmf for integer r. In fact we can showthat this is a well defined pmf for all real r > 0. We will need to use anextended version of the Binomial Theorem, corresponding to our extendeddefinition of

    (xk

    ).

    Slide 167

  • Extended Binomial TheoremExtended Binomial TheoremFor any real r (as opposed to just integer r) we have

    (1 + b)r =k=0

    (r

    k

    )bk

    which converges provided |b| < 1.Provided p )= 0, it follows that

    prz=0

    (rz

    )(p 1)z = pr(1 + (p 1))r = prpr = 1.

    However to ensure that all the individual terms are non-negative we need tohave r > 0.

    Slide 168

  • Negative Binomial DistributionNegative Binomial DistributionIf the random variable Z has pmf

    pZ(z) =

    (rz

    )pr(p 1)z, z = 0, 1, 2, . . .

    where r > 0 and 0 < p 1, then we say Z has a Negative Binomialdistribution with parameters r and p and write Z d= Nb(r, p).

    In the special case where r is an integer then Z can be interpreted as thenumber of failures before the rth success in a sequence of independentBernoulli trials with p = P (success).

    Slide 169

  • Important Note: Ghahramani defines the Negative binomial distribution asthe sum of r independent Geometric random variables using its definition forthe Geometric (which differs from ours). So in Ghahramani the Negativebinomial takes values (r, r + 1, r + 2, . . .) and is simply shifted in locationr units to the right.

    Slide 170

  • Negative binomial mean and varianceNegative binomial mean and varianceWe can deduce that

    E(X) =r(1 p)

    p.

    V (X) =r(1 p)

    p2.

    These results are most easily proved using techniques which we willexamine later in the course.

    Slide 171

  • ExampleExampleTo complete his degree a part time student needs to do three more subjects.Assuming he can only take one subject per semester, and that he passes asubject with probability 0.85 independently of his past results, find theprobability that he will need more than 2 but not more than 3 years tograduate ?

    Slide 172

  • Negative Binomial shapeNegative Binomial shapeWe can show that the ratio of successive negative binomial probabilitiesr(x) satisfies

    r(z) =pZ(z)

    pZ(z 1) = (r 1z

    + 1)(1 p) z = 1, 2, . . .

    which decreases as z increases.

    If z < 1pp (r 1) then r(z) > 1 and the pmf is increasing.If z > 1pp (r 1) then r(z) < 1 and the pmf is decreasing.So the Negative Binomial distribution only has a single peak.

    Slide 173

  • There is a relationship between the distribution functions of a NegativeBinomial random variable Z d= Nb(r, p) and a Binomially Distributedrandom variableX d= Bi(n, p). Indeed, we have {Z n r} is thesame as {at most n r failures before the rth success} which is the sameas { at most n trials to get r successes} which is the same as {number ofsuccesses in first n trials is greater than or equal to r}, or {X r}.

    Slide 174

  • Hypergeometric random variablesHypergeometric random variables(Ghahramani 5.3)(Ghahramani 5.3)When we looked at binomial random variables, we considered the experimentof sampling with replacement. In many, if not most, experiments of this type itis more natural to sample without replacement. If we perform such anexperiment, then the number of successes is no longer binomially distributed.

    We need a different distribution to describe the number of successes - thehypergeometric distribution.

    Slide 175

  • Again suppose that the population consists ofN objects, a proportion p ofwhich are defective. The number of defective items in the population isthereforeD = Np. A sample of n is obtained by selecting n objects atrandom either all at once or sequentially without replacement. The twoprocedures are equivalent.

    Slide 176

  • Again we letX be number of defectives obtained in the sample. In this casethe pmf ofX is given by

    pX(x) =

    (D

    x

    )(N Dn x

    )(N

    n

    ) (x = 0, 1, 2, . . . , n).

    Slide 177

  • IfX = x, then the sample contains x defectives and n x non-defectives.The defectives can be chosen in

    (Dx

    )ways, and for each way of choosing the

    defectives, the non-defectives can be chosen in(NDnx

    )ways. So the

    number of ways of choosing a sample which contains x defectives is(Dx

    )(NDnx

    ).

    There are(Nn

    )ways of choosing a sample of n from a population ofN , and

    each is equally likely since the selection is made at random.

    The expression for the pmf follows.

    Slide 178

  • Note that pX(x) > 0 only if 0 x D and 0 n x N D, sinceotherwise one or other of

    (Dx

    )or(NDnx

    )is zero.

    Therefore pX(x) > 0 only if A x B, whereA = max(0, n+D N), and B = min(n,D).Nevertheless, we usually denote the set of possible values SX as{0 x n} allowing that some of these values may actually have zeroprobability.

    Clearly, pX(x) 0. It can be shown that

    pX(x) = 1, by equatingcoefficients of sn on both sides of the identity(1 + s)D(1 + s)ND = (1 + s)N .

    Slide 179

  • Hypergeometric DistributionHypergeometric DistributionIfX has pmf pX(x) =

    (Dx)(NDnx )

    (Nn), (x = A . . . , B), where A andB are

    defined above, then we say thatX has a hypergeometric distribution withparameters n,D andN and we writeX d= Hg(n,D,N).

    Slide 180

  • Hypergeometric mean and varianceHypergeometric mean and varianceIt can be shown that

    E(X) =nD

    N.

    V (X) =nD(N D)

    N2(1 n 1

    N 1).

    It is interesting to compare these formulae to those for the Binomialdistribution with p replaced by the percentage of defectives DN .

    Slide 181

  • ExampleExampleIf a hand of five cards is dealt from a well-shuffled pack of fifty-two cards, thenumber of spades in the hand,X d= Hg(n = 5,D = 13, N = 52).

    Slide 182

  • Poisson Random VariablesPoisson Random Variables(Ghahramani 5.2)(Ghahramani 5.2)Recall that a Binomial random variable counts the total number of successesin a sequence of n independent Bernoulli trials. If we think of a success asan event then a Binomial random variable effectively counts eventsoccurring in discrete time.

    A Poisson random variable is an analogue of the Binomial random variablewhich effectively counts events occurring in continuous time. However bothtypes count events so both are discrete random variables.

    We derive the Poisson distribution via a limiting process involving sequencesof Bernoulli trials as follows.

    Slide 183

  • Assume that each Bernoulli trial takes a time 1/n to complete and that theprobability of success in a Bernoulli trial is proportional to this time, sayP (success) = /n. Then, by time 1, we can complete n trials. LetN =number of events which occur by time 1. ThenN d= Bi(n, n ) and hence

    P(N = k

    )=(nk

    )(n )

    k(1 n )nk

    Now we shrink the length of time for each trial and the success probability atthe same rate, by letting n.

    Slide 184

  • It is a basic mathematical fact that

    limn(1 +

    x

    n)n = ex

    and solimn

    (1

    n

    )n= e.

    Slide 185

  • So we have

    limn P

    (N = k

    )= lim

    n(nk

    )(n )

    k(1 n )nk

    = limn

    n!

    nk(n k)!k

    k!(1 n )n(1 n )k

    = 1 k

    k! e 1

    =ek

    k!(k = 0, 1, 2, . . .).

    Slide 186

  • IfN has pmf pN (k) =ek

    k!(k = 0, 1, 2, . . .), we say thatN has a

    Poisson distribution with parameter , and we writeN d= Pn().

    So in this caseN d= Pn().

    Note that pN (k) 0, and thatk=0 pN (k) = 1, sincek=0

    k

    k!= e.

    This is the Taylor series expansion of e and is another basic mathematicalfact that is worth remembering.

    Slide 187

  • Poisson mean and variancePoisson mean and varianceApplying our formulae for expectations we can deduce that

    E(X) = .

    E(X(X 1)) = 2.V (X) = E(X(X 1)) +E(X) E(X)2

    = .

    Note the interesting fact that the mean and variance of a Poisson randomvariable are equal.

    Slide 188

  • ExampleExampleAssume cars pass an isolated petrol station on a country road at a constantmean rate of 5 per hour, or equivalently 2.5 per half hour. LetN denote thenumber of cars which pass the petrol station whilst it is temporarily closed forhalf an hour one Friday afternoon. What is the probability that the stationmissed out on three or more potential customers ?

    Slide 189

  • The average number of cars (events) in half an hour is 2.5 soN

    d= Pn(2.5).

    P (N 3) = 1 P (N 2)= 1 {e

    2.52.50

    0!+e2.52.51

    1!+e2.52.52

    2!}

    = 1 {0.0821 + 0.2052 + 0.2565}= 0.4562.

    Slide 190

  • Our original derivation of the Poisson distribution shows that it approximatesthe Binomial distribution under the right circumstances. As we shrunk thelength of time for each trial and the success probability at the same rate, inthe Binomial distribution forN d= Bi(n, n ) we had

    Number of trials: n,Probability of success: n 0,

    Average number of events: (n)n = .

    Slide 191

  • Poisson approximation to BinomialPoisson approximation to BinomialSo, if n is large and p is small, the Poisson distribution with = np can beused as a convenient approximation to the Binomial distribution:

    Bi(n, p) Pn(np) for n large and p small.

    A rough rule is that the approximation is satisfactory if n 20 andp 0.05.

    Slide 192

  • ExampleExampleOne in 10, 000 items from a production line is defective. The occurence ofdefects in successive items are independent. What is the probability that in abatch of 20, 000 items there will be at least 4 defective items ?

    Slide 193

  • LetX be the total number of defectives. Then

    Xd= Bi(20000,

    1

    10000).

    As n is large and p is small,X Pn(2) soP (X 4) = 1 P (X 3)

    = 1 {e220

    0!+e221

    1!+e222

    2!+e223

    3!}

    = 0.1429.

    Slide 194

  • ExampleExample100, 000, 000 games of Super 66 Lotto are played. The probability ofwinning the Division 1 prize is 1/(45, 360, 620). How many Division 1winners can be expected?

    Slide 195

  • Discrete uniform random variablesDiscrete uniform random variablesConsider the discrete random variableX having pmf

    pX(x) =1

    nm+ 1 (x = m,m+ 1, . . . , n)wherem and n are integers such thatm n. We say thatX has adiscrete uniform distribution onm x n, and we writeX d= U(m,n).IfX denotes the result of throwing a fair die, thenX d= U(1, 6).

    If Y denotes the result of spinning a roulette wheel (with one zero), thenY

    d= U(0, 36).

    Slide 196

  • Discrete uniform mean and varianceDiscrete uniform mean and varianceApplying our formulae for expectations we can deduce that forX

    d= U(0, n)

    E(X) =n

    2,

    E(X2) =1

    6n(2n+ 1),

    V (X) =1

    12n(n+ 2).

    Slide 197

  • ExampleExampleConsider a sequence of independent Bernoulli trials with probability ofsuccess p. We are given the additional information that in the first n trialsthere is exactly one success. LetX denote the number of the (random) trialat which this single success occurred. What is the pmf ofX ?

    Slide 198

  • ExampleExampleClearly the possible values ofX are SX = {1, 2, . . . , n}.

    pX(x) =P (xth trial a success, other (n 1) trials failures)

    P (one success in n trials)

    =p(1 p)n1

    (n1 )p(1 p)n1

    =1

    n(x = 1, 2, . . . , n).

    ThereforeX d= U(1, n).

    Slide 199

  • Continuous uniform random variablesContinuous uniform random variables(Ghahramani 7.1)(Ghahramani 7.1)We have seen that a discrete uniform random variableX has a pmf that isconstant over the possible values in the set SX , which must be finite. Incontrast a continuous uniform random variableX has a pdf that is constantover the possible values in the set SX , which must be bounded.

    Slide 200

  • Let a < b be real numbers. Then a continuous random variableX , havingpdf given by

    fX(x) =1

    b a (a < x < b)has a continuous uniform or rectangular distribution on the interval (a, b)and we writeX d= R(a, b).

    Slide 201

  • ExamplesExamples1. If a set of real numbers are rounded down to the integer below, then for a

    number randomly selected from the set, it is often reasonable to assumethat the error is distributed as R(0, 1).

    2. If a needle is thrown randomly onto a horizontal surface, the acute anglethat it makes with a specified direction, d= R(0, pi2 ).

    Slide 202

  • Mean and variance of a Continuous UniformRandom VariableMean and variance of a Continuous UniformRandom VariableExercise: Show that ifX d= R(a, b) then

    E(X) =a+ b

    2.

    V (X) =1

    12(b a)2.

    Slide 203

  • Exponential Random VariablesExponential Random Variables(Ghahramani 7.3)(Ghahramani 7.3)Recall that a geometric random variable modelled the number of failuresbefore the first success in a sequence of Bernoulli trials.

    Exponential random variables can be thought of as a continuous version ofgeometric random variables. They model the waiting time until arandomly-generated event occurs in continuous time. We derive them via alimiting process involving sequences of Bernoulli trials as follows.

    Slide 204

  • Assume that each Bernoulli trial takes a time 1/n to complete and that theprobability of success in a Bernoulli trial is proportional to this time, sayP (success) = /n. Then, in time t, we can complete nt trials and theprobability that there is no success in a time period t is(

    1 n

    )nt.

    Now we shrink the length of time for each trial and the success probability atthe same rate, by letting n.

    Slide 205

  • Thus the probability that there is no event in time t is et.

    Let T be the waiting time to the first event. Then this is equivalent to sayingthat

    P (T > t) = et.

    Slide 206

  • The distribution function of T is therefore given by

    FT (t) = P (T t) = 1 P (T > t) = 1 et, for t > 0and as T cant be negative, FT (t) = 0 for t < 0.

    A continuous random variable T with this distribution is known as anexponential random variable with parameter and we write T d= exp().

    Slide 207

  • Differentiating tells us that the pdf of T is

    fT (t) =

    e

    t t > 0

    0 t < 0.

    Note that fT (t) 0 and that fT (t)dt = 1.

    Slide 208

  • ExampleExampleThe waiting times between successive cars on a country road areexponentially distributed with parameter = 10. Find the probability that agap between successive cars exceeds 0.2 time units.

    SolutionT = time gap between cars d= exp( = 10).

    Therefore P (T > t) = e100.2 = e2 = 0.1353.

    Slide 209

  • Exponential Mean and VarianceExponential Mean and VarianceUsing the definition of expectation, forX d= exp() we can write

    E(X) =

    0

    xfX(x)dx =

    0

    x exdx.

    Exercise: Use integration by parts to find E(X).

    However there is another approach which uses a generally useful alternativeformula for calculating E(X).

    Slide 210

  • We can derive a formula for the expected value in terms of the distributionfunction, rather than the pmf or pdf. This is most easily seen in the case of acontinuous random variable. Extend the definition of pdf so that it is definedon the interval (,) with fX(x) = 0 at some points if necessary.Then

    E[X] =

    xfX(x)dx

    =

    0

    xfX(x)dx+

    0

    xfX(x)dx.

    Slide 211

  • Now use a trick. We know thatd

    dxFX(x) = fX(x).

    However, it is also true that

    d

    dx (1 FX(x)) = fX(x).

    Slide 212

  • We perform integration by parts using FX(x) as an anti-derivative of fX(x)in the first integral and(1 FX(x)) as an anti-derivative of fX(x) in thesecond integral. Thus

    E[X] =

    0

    xfX(x)dx+

    0

    xfX(x)dx

    = [xFX(x)]0

    0

    FX(x)dx

    + [x(1 FX(x))]0 0(1 FX(x))dx

    = 0

    FX(x)dx+

    0

    (1 FX(x))dx.

    The facts that FX() = 0 and FX() = 1 ensure that the first andthird terms in the middle equation are zero.

    Slide 213

  • In the special case that the random variable is non-negative, the aboveformula reduces to

    E[X] =

    0

    (1 FX(x)) dx.

    which turns out to be very useful.

    Example

    IfX d= exp(), then, for x > 0, FX(x) = 1 ex. Hence

    E[X] =

    0

    (1 FX(x)) dx = 0

    exdx =1

    .

    Slide 214

  • To calculate the variance forX d= exp() we first use repeated integrationby parts to show

    E(X2) =

    0

    x2exdx =2

    2.

    Hence we have

    V (X) = E(X2) E(X)2 = 22 1

    2=

    1

    2.

    We note the interesting fact that the mean and standard deviation of theexponential distribution are equal.

    Slide 215

  • Lack of memory propertyLack of memory propertyLike the geometric distribution, the exponential distribution has the lack ofmemory property. IfX d= exp() then, for x [0,),

    P (X x) = ex.So for given x, y [0,), we have:

    P (X y x |X y) = P (X x+ y |X y)=

    P (X x+ y)P (X y)

    =e(x+y)

    ey= ex

    = P (X x).

    Slide 216

  • Gamma random variablesGamma random variables(Ghahramani 7.4)(Ghahramani 7.4)An exponential random variable can be thought of as a continuous analogueof a geometric random variable. Both model the waiting time until the firstoccurrence of an event, one in continuous time and one in discrete time.

    Similarly, a gamma random variable can be thought of as a continuousanalogue of a negative binomial random variable. Both model the waitingtime until the rth occurrence of an event, one in continuous time and one indiscrete time.

    Slide 217

  • Consider the same sequence of Bernoulli trials that we used to define thePoisson and exponential random variables. That is, each trial takes a time1/n to complete and that the probability of success is P (success) = /n.

    Now, the probability that we have to wait for more than nz trials for the rthsuccess is the same as the probability that we have less than r successes inthe first nz trials.

    Slide 218

  • The number of successes that we have in nz trials is binomially distributed,so the probability that we have less than r successes in the first nz trials is

    r1k=0

    (nz

    k

    )(n

    )k (1

    n

    )nzk.

    As n, this approachesr1k=0

    (z)k

    k!ez .

    Slide 219

  • So, if we let Z denote the waiting time until the rth event, then

    P (Z > z) =r1k=0

    (z)k

    k!ez.

    Therefore the distribution function of Z is

    FZ(z) = P (Z z) = 1r1k=0

    (z)k

    k!ez .

    Slide 220

  • To get the pdf of Z , we differentiate this with respect to z. Thus

    fZ(z) =d

    dz

    [1

    r1k=0

    (z)k

    k!ez

    ]

    = ez +r1k=1

    [ z

    k1k

    (k 1)!ez +

    zkk+1

    (k)!ez

    ]

    =rzr1

    (r 1)!ez .

    Slide 221

  • A continuous random variable Z with pdf given by

    fZ(z) =rzr1

    (r 1)!ez .

    is known as an gamma or Erlang random variable with parameters r and .We write Z d= (r,).

    As we saw above, for z > 0, the distribution function of a Gamma randomvariable is

    FZ(z) = 1r1k=0

    (z)k

    k!ez .

    Note that exp() = (1,).

    Slide 222

  • The scope of the gamma distribution can be extended to all real r by usingthe continuous analogue of the factorial, which is the gamma function

    (r) =

    0

    exxr1dx.

    Using integration by parts, it is readily shown that (r) = (r 1)(r 1)for all r > 0. Then, since (1) =

    0 e

    xdx = 1, it follows that if k is apositive integer then (k) = (k 1)!

    Slide 223

  • For z > 0, the function

    f(z) =rezzr1

    (r)

    is a pdf.

    This follows because f(z) 0 and also f(z)dz = 1. This laststatement follows from

    0ezrzr1dz =

    0

    euur1du

    = (r),

    using the substitution u = z.

    Slide 224

  • ExampleExampleA machine has a component which fails once every 100 hours on average.The failures are randomly distributed in time. There are only threereplacement components available. The time (in hours) for which themachine will remain operative Z is the waiting time until the fourth eventwhen events occur at 0.01 (failures/hour).Therefore Z d= (r = 4, = 0.01).

    Slide 225

  • Gamma mean and varianceGamma mean and varianceWe start by deriving a formula for the kth moment ofX d= (r,) asfollows:

    E(Xk) =

    0

    rexxr+k1

    (r)dx

    =1

    k(r)

    0

    euur+k1du [u = x]

    =(r + k)

    (r)k.

    Slide 226

  • Note that:

    1. E[Xk] )= E[X]k = rkk except if k = 1 or k = 02. if k is a positive integer, then E[Xk] = r(r+1)(r+k1)k

    3. this result applies for all values of k for which the integral converges: forexample E[X1] = r1 , provided r > 1.

    Hence forX d= (r,):

    E(X) =r

    V (X) = E(X2) E(X)2 = (r + 1)r2

    r2

    2=

    r

    2.

    Slide 227

  • Normal Random VariablesNormal Random Variables(Ghahramani 7.2)(Ghahramani 7.2)We turn now to one of the most important distributions in probability andstatistics - the Normal distribution. It turns out, as we shall see later (Chapter7), that the normal distribution occurs frequently as a limit. In particular theCentral Limit Theorem says that the sum of a large number of independentand identically distributed random variables is approximately normallydistributed irrespective of the specific underlying distribution, provided that ithas finite mean and variance.

    This is the primary reason for the importance of the normal distribution,which is endemic in nature.

    Slide 228

  • IfX has pdf

    fX(x) =1

    2pi

    e1

    22(x)2 ( < x

  • StandardisationStandardisationIf Z = X , with > 0, then the distribution function of Z is given by

    FZ(z) = P (Z z) = P (X

    z)= P (X + z)= FX(+ z).

    Differentiating with respect to z gives

    fZ(z) = fX(+ z) = (z).

    Hence, we have shown that:

    IfX d= N(,2), then Z = Xd= N(0, 1).

    Slide 230

  • This means we can calculate probabilities for any normal distribution usingthe standard normal distribution.

    The distribution function of Z d= N(0, 1) is denoted by

    (z) =

    z

    12pi

    e12 t

    2

    dt.

    There is no explicit formula for this integral, so tables of (z) have beencompiled using numerical integration.

    Tables 1 and 2 in Ghahramani give values of (z). The standard normaldistribution function is also available on many calculators. As (z) is aneven function, (z) = 1 (z).

    Slide 231

  • ExampleExampleIfX d= N(10, 25) and Z d= N(0, 1) then

    P (X < 8) = P(X 10

    5 0.4)

    = 1 P (Z < 0.4)= 0.3446.

    Slide 232

  • The converse of our standardisation result is also easy to prove, that is

    If Z d= N(0, 1), thenX = + Z d= N(,2).

    This is used in simulations to generate observations on any normal randomvariable by first generating observations on a standard normal and thentransforming them as indicated.

    It also implies thatE(X) = + E(Z)

    andV (X) = 2V (Z),

    so we can find the mean and variance ofX d= N(,2) from thecorresponding quantities for Z . As an exercise we will actually find all of themoments of Z .

    Slide 233

  • Moments of standard normal rvMoments of standard normal rvFor Z d= N(0, 1), we have

    E[Zn] =

    zn12pi

    e12 z

    2

    dz.

    Integrating by parts with u = zn1 and dv =(1/2pi)ze

    12 z

    2dz, we

    obtain

    E[Zn] =[ 1

    2pizn1e

    12 z

    2]+

    (n1)zn2 12pi

    e12 z

    2

    dz.

    ThereforeE[Zn] = (n 1)E[Zn2].

    Slide 234

  • We saw thatE[Zn] = (n 1)E[Zn2].

    We also know that E[Z0] = 1 (as (z) is a pdf) and E[Z1] = 0 (as thepdf is an even function). It follows that

    E[Z2k+1] = 0

    andE[Z2k] = (2k 1)(2k 3) . . . .1 = (2k)!/(2kk!).

    For k = 1 we haveE(Z2) = 1 and hence V (Z) = 1 0 = 1.

    Slide 235

  • Consequently forX d= N(,2)

    E(X) = + 0 = and V (X) = 2 1 = 2.

    So the parameters for the normal distribution are in fact equal to its mean and variance 2 (justifying our choice of notation). We say that the normaldistribution is parameterised by its mean and variance.

    Slide 236

  • Normal approximationsNormal approximationsThe Central Limit Theorem tells us that ifX1,X2, . . . are independentidentically distributed random variables with finite mean and variance thenthe sum

    S = X1 +X2 + . . .+Xn

    is approximately Normal as n

    Slide 237

  • We already know that forN d= Nb(r, p)

    N = X1 +X2 + . . .+Xr

    where eachXid= G(p) and they are independent. So

    Nb(r, p) Normal as r .

    Slide 238

  • Consider the Binomial random variableM d= Bi(n, p). It counts thenumber of successes in n independent Bernoulli trials with successprobability p so

    M = X1 +X2 + . . .+Xn

    where eachXi is a Bernoulli random variable (Bi(1, p)) and they areindependent. So

    Bi(n, p) Normal as n.

    Slide 239

  • Consider the Poisson random variableN d= Pn(n). It counts the numberof events in the interval (0, n) where the events occur in continuous time at auniform rate of per unit time. Divide (0, n) into n subintervals of unitlength. Then

    N = X1 +X2 + . . .+Xn

    where eachXid= Pn() counts the number of events in the ith

    subinterval and they are independent. So, putting = n

    Pn() Normal as .

    Slide 240

  • Transformations of Random VariablesTransformations of Random Variables(Ghahramani 6.2)(Ghahramani 6.2