comp4121 lecture notes - week 0cs4121/lectures_2019/prob.pdf · 2019-09-15 · comp4121 lecture...

COMP4121 Lecture Notes - Week 0

A (very brief) probability and statistics refresher

LiC: Aleks Ignjatovic

[email protected]

THE UNIVERSITY OFNEW SOUTH WALES

School of Computer Science and EngineeringThe University of New South Wales

Sydney 2052, Australia

CAVEAT:

This is a VERY BRIEF REFRESHER ONLY of basic Probability and Statis-tics, which we will be expanding as we progress through the course by introducingmore notions as we need them. It is definitely NOT meant to be a replacementfor a proper course in Probability and Statistics; if you have not taken such acourse you should seriously consider taking one, or, at the very least, you shouldread a good book on Statistics, because statistical methods are essential for de-sign and testing of modern algorithms.1

1 Basic Definitions

• The sample space, denoted Ω, is the set of all possible outcomes of a process(usually called an experiment) whose outcomes are non-deterministic, i.e, whichmight be different if the experiment were repeated.

Examples: Consider the following two experiments:

1. we toss a coin 3 times in a row;

2. we measure the amplitude of a signal in the presence of noise (which isalways present!)

In the first example the sample space is the set

Ω = HHH,HHT,HTH,HTT, THH, THT, TTH, TTT

Note that the outcomes of each individual tossing of the coin are NOT in thesample space but only the triplets belong to it. In the second example the sam-ple space is the set of real numbers R (or a subset of R which corresponds tothe range of your measuring instrument).

• An event is any subset of the sample space. Events which we are inter-ested in are usually definable by a some property which each individual sam-ple might have or might not have. For example, in Example 1 above, onesuch property could be getting at least two heads and the samples which sat-isfy this property are HHH,HHT,HTH, THH; thus, this event is the setHHH,HHT,HTH, THH. In the second example an event might be “get-ting a measurement in the range 0− 1”; thus the corresponding set of samplesis the interval [0, 1] ⊂ R. An event “has happened” (has been realised) in anexperiment if the sample which is the outcome of the experiment belongs to thisevent.

1One of the best books on statistics for computer scientists, which we closely follow here, isLarry Wasserman’s “All of Statistics” which dispenses with all academic “niceties” and “ped-agogical examples” and cuts straight into the core of the subject matter, covering essentiallyall of the topics which are important for the present day computer science, in particular foralgorithms in machine learning, information retrieval and data-mining, etc.

1

• Probability of an event is given by a probability distribution function Pwhich assigns numbers to events‡ such that the assigned values are all in therange [0, 1] and, in particular, the probability assigned to the entire set Ω of alloutcomes is 1. More over, if A =

⋃∞i=1Ai and if all Ai are pairwise disjoint sets,

then P(A) =∑∞i=1 P(Ai) must hold.

• Events Ai1≤i≤n are independent if

P

(n⋂i=1

Ai

)=

n∏i=1

P(Ai).

• Conditional probability of A given B is defined as

P(A|B) =P(A ∩B)

P(B).

It is the probability that event A has happened assuming that event B hashappened; see the figure 1. Thus, B becomes the universe of all possible events(because we assume that B has happened); the probability that A has alsohappened is then simply equal to whatever fraction A ∩ B is of the entire B(with the probability of a set as the measurement of the “size” of a set).

Ω

B

A

Figure 1.1: P(A|B) = P(A ∩B)/P(B)

Note that the above implies that if A is independent of B then

P(A|B) =P(A ∩B)

P(B)=P(A)P(B)

P(B)= P(A),

i.e, the assumption that B has happened has no impact on the probability thatA has also happened, which justifies the term.

‡You can ignore the footnotes marked with §, like the present one; they are there just tokeep the mathematicians quiet, because they state something irrelevant to engineers, such asthe one which now follows: It is generally NOT possible to assign a probability to every singleevent (i.e., to every subset of the sample space), but such “pathological” sets can be safelyignored by engineers simply because they do not correspond to any events that make sense inpractice; their “existence” relies on fancy infinitary set theory (Axiom of Choice) which, foran engineer, is as relevant as theology.

2

Let B be an arbitrary event, and assume that the entire universe Ω of alloutcomes has been partitioned into disjoint subsets Ai1≤i≤n, i.e., assume thatAi satisfy: for all i, j ≤ n,

Ai ∩Aj = ∅ and⋃

1≤i≤n

Ai = Ω;

see figure 1. Then,

Figure 1.2:

P(B) =

n∑i=1

P(Ai ∩B) =

n∑i=1

P(B |Ai)P(Ai)

Thus, we get Bayes’ Theorem:

P(Ai |B) =P(Ai ∩B)

P(B)=P(B |Ai)P(Ai)

P(B)=

P(B |Ai)P(Ai)∑ni=1 P(B |Ai)P(Ai)

. (1.1)

Note that P(Ai) is called the prior probability of A (before B has been observedto have happened) and P(Ai |B) is the posterior probability (after event B wasobserved). Bayes’ Theorem is important for Computer Science because of theBayesian Belief Revision, in which our belief changes according to what eventshave been observed to have happened.

Example: A computer manufacturing company has two suppliers of hard diskdrives. Company C1 supplies 40% and C2 supplies 60% of all disks. Estimatesfrom the past experience suggest that .4% of C1′s drives and .3% of C2′s drivesare defective.

1. What is the probability that a randomly selected disk is from companyC1?

2. Such randomly selected disk has subsequently been tested and found tobe defective. What is now the probability that it is from C1?

3

Let us denote by A1 (A2, respectively) the event that the randomly chosen diskis from company C1 (C2, respectively) and by B the event that the disk isdefective; thus,

P(A1) = 40/100; P(A2) = 60/100; P(B |A1) = 4/1000; P(B |A2) = 3/1000.

Clearly, the answer to question 1 is P(A1) = 40/100 which is the prior probabilitythat the disk is from company C1. After the new piece of information hasarrived, knowing that the selected disk is defective, we can revise the degree ofour belief that the disk is from company C1, by using Bayes’ Theorem: since

P(B) = P(B |A1)P(A1) + P(B |A2)P(A2)

= 4/1000× 40/100 + 3/1000× 60/100 = 0.0034

we get

P(A1 |B) =P(B |A1)P(A1)

P(B)=

4/1000× 40/100

0.0034≈ 0.47

Thus, after the new piece of information has arrived, our belief that the diskcomes from C1 has increased from 0.4 to 0.47 which is to be expected, be-cause disks from company C1 are more frequently defective than the disks fromcompany C2.

2 Random Variables

• A random variable X is a mapping from the set of all samples into the setof real numbers. Thus, X assigns a value X(ω) to every sample ω.‡

• A discrete random variable X is a random variable which can have at mostcountably many values x0, . . . , xn, . . .; a random variable which can have con-tinuum many values is called a continuous random variable.

• The cumulative distribution function (CDF) of a random variable X isthe mapping from R into interval [0, 1] given by FX(x) = P(ω : X(ω) ≤ x)or, in a more usual notation, FX(x) = P(X ≤ x).

• The probability mass function fX(x) of a discrete random variable Xspecifies the probability that X attains each of its possible values xi, i.e.,fX(xi) = P(X = xi) = pi; clearly,

∑1≤i≤n pi = 1.

• The probability density function (PDF) of a continuous random variableis a function fX(x) ≥ 0 satisfying:∫ a

−∞fX(x)dx = P(X ≤ a). (2.1)

‡Just as with probability distributions, not all such mappings are legitimate random vari-ables, but only those which have the property that the inverse images X−1[a, b] of intervals[a, b] are in the domain of the corresponding probability distribution on the (subsets of the)sample space. Since the inverse image of an interval can be “weird” only for “very weird”random variables X, all mappings from the sample space into R which make physical senseare in fact legitimate random variables. Thus, the engineers can safely let the mathematiciansworry about the rest.

4

This implies that for all a, b such that b > a,

P(a ≤ X ≤ b) = P(X ≤ b)− P(X ≤ a) =

∫ b

−∞fX(x)dx−

∫ a

−∞fX(x)dx

=

∫ b

a

fX(x)dx (2.2)

Note that, by definition, a random variable is a fully deterministic mappingfrom the set of all possible outcomes of an experiment into real numbers; wecan think of a random variable as a (deterministic) measurement performedon the outcome of an experiment. The values of a random variable are non-deterministic only because its inputs are such.

However, we usually do not care about particular samples which are theoutcomes of an experiment, but only about the values of the “measurements”performed on these random samples, i.e., we only care about the values of therandom variables representing the values of these measurements. Thus, weusually ignore the samples altogether and consider a random variable itself as a“probabilistic object” specified by the composition of X−1 and the probabilitydistribution P: for every x ∈ R let

Ax = ω : X(ω) ≤ x = X−1((−∞, x]

);

then x is assigned the value P(Ax) = P(X ≤ x).

Figure 2.1: A random variable seen as a mapping from R into [0, 1]

Using the first equality of (2.2) we get the probability of the event that thevalue of a random variable X belongs to an interval [a, b]; this can be extendedto obtain the probability that the value of X belongs to any “reasonable” sub-set A ⊂ R. Thus, every random variable induces a probability distributionon R and we can now ignore the set of samples Ω and deal only with randomvariables and their corresponding probability distributions P(X ∈ A) and their

5

corresponding probability density functions fX(x); see figure 2.‡‡

Warning: Random variables CANNOT be identified with their probabilitydistributions: for example, if a distribution of X is symmetric around the origin,in the sense that P(X ≤ a) = P(X ≥ −a) for every a ∈ R, then

P(X ≤ a) = P(X ≥ −a) = P(−X ≤ a).

Thus, X and −X have the same probability distribution, but are clearly notequal as mappings from samples into reals.

• Two random variables X,Y are equally distributed if they have the sameprobability distribution function, in which case we write X ∼ Y .

If F (x), (f(x), respectively) is a probability distribution (probability densityfunction, respectively) and X has probability distribution F (x) (probabilitydensity function f(x), respectively) we abuse the notation and also write X ∼ F(X ∼ f , respectively).

3 Expectation and Variance of a Random Vari-able

The expected value of a random variable is simply its mean. Thus, if X can takecountably many values v0, v1, . . . with the corresponding probabilities p0, p1, . . .,then

E(X) =

∞∑i=1

vi · pi

providing that the sum converges. Thus, not every random variable has a finitemean value.

If X is a continuous random variable with probability density function f(x),then

E(X) =

∫ ∞−∞

xf(x)dx

again assuming that such integral converges.

‡‡Again, not every continuous random variable has a corresponding probability densityfunction, but only those which induce a “well-behaved” probability distribution P(X ≤ x),the so called absolutely continuous probability distributions. Roughly speaking, “well behaved”probability distributions are those which assign probability zero to all sets whose “geometricsize” is zero (such as, for example, sets of isolated points). Since all physically meaningfulrandom variables have this property, for engineers all continuous random variables have aprobability density function, which the mathematicians call the Radon - Nikodym derivativeof the probability distribution P(X < x) - a term justified by taking the derivatives withrespect to a of both sides of equation (2.1).

6

Sometimes expectation is written as E(X) =∫xdFX(x) where F (x) is the

cumulative distribution function, F (x) = P(X ≤ x); this is not just a notationbut it can be given a precise meaning using more sophisticated mathematicalmachinery.‡

Properties of expectation

• Let Y = g(X); then E(Y ) =∫g(x)dFX(x).

• Let X1, . . . , Xn be random variables and c1, . . . , cn real numbers; then

E(c1X1 + . . .+ cnXn) = c1E(X1) + . . .+ cnE(Xn).

• Assume X1, . . . , Xn are independent; then

E(X1X2 · · ·Xn) = E(X1)E(X2) . . . E(Xn)

The kth moment of a distribution X is defined as E(Xk), assuming thatE(|X|k) exists, i.e. assuming that

E(|X|k) =

∫|x|kdFX(x) <∞

Clearly, if E(|X|k) < ∞, then also E(Xk) < E(|X|k) < ∞. The strongerassumption, namely E(|X|k) < ∞ rather than just E(Xk) < ∞ ensures thatthe moments are “well behaved”. For example one can easily verify that ifE(|X|k) < ∞, then also E(Xm) ≤ E(|X|m) < ∞ for all m ≤ k, i.e., the ex-istence of a higher order moment guarantees the existence of all lower ordermoments.

The kth central moment of a distribution X is defined as the kth moment ofY = X − E(X).

The moment generating function of a random variable X is the functionm(t) = E(etX). The reason for such a name is that the Taylor Expansion ofm(t), obtained by expanding the exponential function into a series and usingthe linearity of expectation,

m(t) = E(etX) = E

( ∞∑k=1

tkXk

k!

)=

∞∑k=1

tkE(Xk)

k!

and E(Xk), (1 ≤ k), are the moments of X.

Variance and Standard Deviation of a Random Variable

The variance of a random variable is defined as V (X) = E(X − E(X))2, as-suming that both expectations involved are finite; the standard deviation of a

‡To make such notation fully meaningful one needs an extension of the notion of a function,called a generalized function (also called a distribution). Generalized functions allow us totake derivatives of step-like functions such as FX(x) when X is a discrete random variable.

7

random variable X is given by σ =√V (X).

Properties of variance

•

V (X) = E(X − E(X))2 (3.1)

= E(X2 − 2XE(X) + (E(X))2) (3.2)

= E(X2)− 2E(X)E(X) + (E(X))2 (3.3)

= E(X2)− (E(X))2 (3.4)

• Let X1, . . . , Xn be independent random variables and c1, . . . , cn realnumbers; then

V (c1X1 + . . .+ cnXn) = c21V (X1) + . . .+ c2nV (Xn).

4 A few most common random variables

We first list a few common discrete random variables.

4.1 The Discrete Uniform Distribution

The Discrete Uniform Distribution on the set of samples of size n assigns prob-ability 1/n to any of the n possible outcomes.

4.2 The Bernoulli Distribution

This is just the probability of getting a head when tossing a (biased) coin. Thus,the set of all outcomes is 0, 1 and P(X = 1) = p; P(X = 0) = 1 − p, forsome 0 ≤ p ≤ 1.

Note: p = 0 does NOT mean that the event X = 1 is physically impossible; itonly means that if you repeated the coin tossing experiment n times, then thenumber of heads you get divided by the total number of trials will converge to0 as n approaches infinity.

4.3 The Binomial Distribution

Let the experiment be “tossing a (biased) coin n times”, and let the randomvariable X be “the number of heads obtained”. Then the values of X are0, 1, . . . , n, and P(X = k) =

(nk

)pk(1 − p)n−k. We denote such a random

variable by Binomial(n, p).

Assume that we toss a coin n times and count the number of heads X1 andthen we toss it then m times and again count the number of heads X2. Whatis the probability distribution of the sum of these two numbers, i.e., what isthe probability distribution of X1 + X2? Clearly, since the above experimentcan be also seen as a single instance of the experiment consisting of tossing the

8

coin n + m times and counting the number of heads, we have that X1 + X2 =Binomial(n+m, p), i.e.,

Binomial(n, p) + Binomial(m, p) ∼ Binomial(n+m, p).

The expected value of Binomial distribution can be found by consideringit as the sum of n Bernoulli distributed independent random variables, X =Y1 + . . . + Yn; then E(X) = nE(Y1) = np; the variance is likewise V (X) =nV (Y1) = nE(X − p)2 = n(E(x2)− p2) = n(p− p2) = np(1− p).

4.4 The Poisson Trials

The Poisson Trials generalize the Binomial Distribution by allowing differentprobabilities at each trial. Thus, in this case the random variable is defined asX =

∑ni=1 where Xi are independent Bernoulli Trials, each with probability of

success of pi. Clearly if pi = p for all i we get just the Binomial Distribution.Clearly, E(X) =

∑ni=1 pi.

4.5 The Geometric Distribution

The Geometric Distribution is just the distribution of the random variable cor-responding to the number of tossing of a (biased) coin you need to perform to geta head. Clearly, its values are all positive integers, and P(X = k) = p(1−p)k−1.

Note that

∞∑k=1

p(1− p)k−1 = p

∞∑k=1

(1− p)k−1 =p

1− (1− p)= 1

4.6 The Poisson Distribution

A discrete random variable X has the Poisson Distribution with parameter λ if

P(X = n) = e−λλn

n!

for all integers n ≥ 0. Note that the definition is correct because∑n≥0

e−λλn

n!= e−λ

∑n≥0

λn

n!= e−λeλ = 1

We now list a few common continuous random variables.

4.7 The Uniform Distribution

The uniform distribution on an interval [a, b] is given by a constant density on[a, b], i.e., if X has the following density function f(x):

f(x) =

1b−a if a ≤ x ≤ b0 otherwise

9

4.8 The Normal (Gaussian) Distribution

X has a normal distribution with the mean µ and standard deviation σ, denotedby X ∼ N (µ, σ2) if

f(x) = e−(x−µ)2

2σ2

If X is a Binomial distribution corresponding to Bernoulli trials with prob-ability of success equal to p, then, if n is reasonably large and p is not too closeto 1 or 0, then X can be reasonably well approximated by a normal distributionwith the same mean and standard deviation, X ∼ N (np, np(1− p))

4.9 The Exponential Distribution

X has an exponential distribution with a parameter β, denoted by X ∼ Exp(β),if

f(x) =1

βe−

xβ , for all x ≥ 0.

5 Some simple inequalities

1 + x ≤ ex for all x ∈ R . (5.1)

Let f(x) = ex − x − 1, since f ′(x) = ex − 1, we have f ′(x) = 0 if and onlyif x = 0. Since f ′′(x) = ex > 0, function f(x) is convex and thus it attainsa minimum at x = 0. Since f(0) = 0 we get that f(x) ≥ 0 for all x. Takingx = −1/n we get the first of the two inequalities below; prove, as an exercisethe second inequality.

(1− 1

n

)n≤ 1

e≤(

1− 1

n

)n−1for all n ∈ N . (5.2)

6 Probability Inequalities

Roughly speaking, probability inequalities estimate the probability that a ran-dom variable X takes values which deviate from its mean µ = E(X).

6.1 The Markov Inequality

Let X > 0 be a non-negative random variable. Then for all t > 0,

P(X ≥ t) ≤ E(X)

t

Thus, taking t = λE(X), we get

P(X ≥ λE(X)) ≤ 1

λ.

10

The Markov inequality follows from the observation that

E(X) =

∫xdFXx =

∫x<t

xdFXx+

∫x≥t

xdFXx

>

∫x≥t

xdFXx >

∫x≥t

tdFXx = t

∫x≥t

dFXx

= tP(X > t)

Note that, if g(x) is a function which always attains only non-negative values,then for every random variable X we have

P(g(X) ≥ t) ≤ E(g(X))

t

and if P(g(X) ≥ t) > 0, then the above proof shows that, in fact,

P(g(X) ≥ t) > E(g(X))

t. (6.1)

One cannot prove a sharper inequality than the Markov Inequality with-out assuming something more that non-negativity of X and the existence of itsexpected value. Assuming that the variance of X also exists, we can get the fol-lowing additive bound (rather than multiplicative, as in the Markov inequality)for how far X is likely to be away from its expected value.

6.2 The Chebyshev Inequality

LetX > 0 be a random variable with the expected value µ = E(X) and standarddeviation σ =

√E((X − µ)2). Then for all λ > 0,

P(|X − µ| ≥ λσ) ≤ 1

λ2

The Chebyshev inequality follows from the Markov inequality applied to therandom variable (X − µ)2. Note that (X − µ)2 ≥ σ2 iff |X − µ| > σ. Thus

P(|X − µ| ≥ λσ) = P((X − µ)2 ≥ λ2 σ2) ≤ E((X − µ)2)

λ2 σ2=

1

λ2

6.3 The Chernoff bound

The last inequality we will use is an example of a technique called the Chernoffbound ; it can be seen as an application of the Markov inequality to the momentgenerating function m(t) = etX , where X is a Poisson Trial.

Theorem: Let X =∑nk=1Xk, where Xk, 1 ≤ k ≤ n, are independent Bernoulli

trials with the probability of success P(Xk = 1) = pk, where 0 < pk < 1. Thus,µ = E(X) =

∑nk=1 pk. Let δ > 0 be any positive real. Then the following

Chernoff bound holds:

P(X > (1 + δ)µ) <

(eδ

(1 + δ)1+δ

)µ

11

Proof: ClearlyP(X > (1 + δ)µ) = P(etX > et(1+δ)µ)

and by the Markov inequality to the right side we get

P(etX > et(1+δ)µ) <E(etX)

et(1+δ)µ

We now use the fact that X =∑nk=1Xk and obtain

P(X > (1 + δ)µ) <E(et

∑nk=1Xk)

et(1+δ)µ=E(∏nk=1 e

tXk)

et(1+δ)µ

Since Xk are independent, so are random variables etXk ; thus

E

(n∏k=1

etXk

)=

n∏k=1

E(etXk)

Clearly, random variable etXk takes the value et·1 = et with probability pk andthe value et·0 = e0 = 1 with probability 1− pk. Thus,

E(etXk) = pk et + 1− pk = 1 + pk(et − 1);

Consequently,

P(X > (1 + δ)µ) <

∏nk=1(1 + pk(et − 1))

et(1+δ)µ

We now apply the inequality (5.1) with x = pk(et − 1), and obtain

P(X > (1 + δ)µ) <

∏nk=1 e

pk(et−1)

et(1+δ)µ=e(e

t−1)∑nk=1 pk

et(1+δ)µ=eµ(e

t−1)

et(1+δ)µ

= eµ(et−1)−t(1+δ)µ

Note that the above derivations hold for every positive t; we now choose twhich yields the tightest bound, which happens if the right hand side achievesa minimum. Thus, we find the stationary points of the function

h(t) = eµ(et−1)−t(1+δ)µ.

Sinceh′(t) = eµ(e

t−1)−t(1+δ)µ(µ et − (1 + δ)µ)

we haveh′(t) = 0 ⇔ µ et − (1 + δ)µ = 0

i.e.,t = ln(1 + δ)

It is easy to verify that h′′(t) = eµ(et−1−(1+δ)t)(µ2(1 + δ− et)2 +µet) > 0 for all

t, which implies that h(t) achieves a minimum at t = ln(1 + δ).Thus, we conclude that

P(X > (1 + δ)µ) < eµ(eln(1+δ)−1)−(1+δ)µ ln(1+δ) = eµ((1+δ)−1)−(1+δ)µ ln(1+δ)

=eµ δ

eln(1+δ)(1+δ)µ=

eµ δ

(1 + δ)(1+δ)µ

=

(eδ

(1 + δ)(1+δ)

)µ

12

comp4121 lecture notes - week 0cs4121/lectures_2019/prob.pdf · 2019-09-15 · comp4121 lecture...

Documents