chapter 2 - sergio turner · ... single: expected value, mean, variance, standard deviation ......
TRANSCRIPT
Chapter 2
Copyright © 2011 Pearson Addison-Wesley. All rights reserved. 1-1
Review of Probability
The probability framework for statistical inference
a) Random variables and probability distributions
b) Single: Expected value, mean, variance, standard deviation
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
c) Two: joint VS marginal VS conditional distributions; independence, covariance, sums of rvs
d) Key distributions: Normal, Chi-squared, Student t, F
e) Random sampling & sample average’s distribution
f) Large sample approximations
1-2
Random variables & probability distrib’ns
• Random variables (rvs): commute time, #computer crashes
• Rvs can be continuous (time) or discrete (#crashes)
• Outcomes: Mutually exclusive values that a rv can take
• Eg: no crash, crash once, crash twice, …; numerically: 0,1,2,…
• Sample space: set of all outcomes, e.g. {0,1,2,…}
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• Sample space: set of all outcomes, e.g. {0,1,2,…}
• Event: By definition a collection of outcomes.
• E.g. “crash no more than once” = {0,1} = {no crash, crash once}
• Probability of an outcome/event: Proportion of time it occurs in the long run (after many independent, identical experiments)
1-3
Probability distrib’n – discrete rv
• Probability distribution of a rv: The list, across outcomes, of the probability of the outcome
• The probabilities in the list add up to 1
• Example: M = #computer crashes while you write paper
• Assume: If four crashes occur, write paper by hand (M<5)
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• Assume: If four crashes occur, write paper by hand (M<5)
• Event = {0,1} has probability Pr(M=0) + Pr(M=1) = .9
• Cumulative distribution function cdf: Prob. rv is at most given value, e.g. cdf(1) = .9
1-4
outcome 0 1 2 3 4
Pr dist .8 .1 .06 .03 .01
Cum dist .8 .9 .96 .99 1
Bernoulli distribution – discrete rv
• If there are only TWO outcomes, rv called Bernoulli
• E.g.: Let G be gender of next person you meet
• Outcomes are “male”, “female”
• If probability of one outcome is p, the other’s must be 1 – p (for prob’s add up to 1)
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
(for prob’s add up to 1)
1-5
Probability distrib’n – continuous rv
• Cumulative probability distribution cdf(x): Probability rv is at most a given value, x
• p. 19, figure 2.2
• Probability density function pdf(x): Intuitively, it is the probability of outcome x … except with a continuous rv, usually this is zero for every x.
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
usually this is zero for every x.
• More accurate: pdf is the function with the property that, for x<y, cdf(y) – cdf(x) = area under pdf between pdf(y) and pdf(x)
• E.g. Probability(commute 15’ - 20’ long) = .78 - .20 = .58
1-6
Expected values, Mean, Variance
• Expected value E(Y) of a random variable Y: the long run average value of the rv (after many independent, repeated ocurrences)
• Its value is denoted µY … Also called expectation of Y or mean of Y
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
Y
mean of Y
• Computed as average of outcomes, each weighted by its probability
• E.g.: E(M) = 0x.8 + 1x.1 + 2x.06 + 3x.03+4x.01 = .35 … the mean number of crashes is .35
1-7
Expected values of Bernoulli rv
• Say Bernoulli G has probability distribution Pr(G=1)=p, Pr(G=0)=1-p
• Then E(G) = 1xp + 0x(1-p) = p
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• That is, its mean is the probability of outcome 1 (whatever it signifies)
1-8
Formulas for expected value
• Discrete rv: If Y can take k outcome values y1, …, yk with probabilities p1, …, pk respectively, then:
• E(Y) = y1· p1 + … yk· pk = ∑i yi· pi
• Continuous rv: If Y has a pdf, with values ranging from L to
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• Continuous rv: If Y has a pdf, with values ranging from L to H, then:
• E(Y) = ∫[L,H] y ·pdf(y) … (just fyi)
• Note: If Y, Z are rvs, then E(Y+Z) = E(Y) +E(Z)
• Note: If c is a constant, then E(cY) = cE(Y)
1-9
Standard deviation and variance
• These measure the spread of a rv
• Variance of Y var(Y) := E[(Y-µY)2], the expected squared
deviation from its mean. Also denoted σ2Y
• Formula: var(Y) = ∑i (yi - µY)2 pi
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• Formula: var(Y) = ∑i (yi - µY)2 pi
• Expanding square: var(Y) = E(Y2) - µY2
• Note: If c a constant, var(cY) = c2var(Y), var(c+Y)=var(Y)
• This involving the square, it is not comparable to Y
• Standard deviation of Y σY:= square root of var(Y)
• Var(M) = .6475, stdev(M) ~ .8
1-10
Variance of Bernoulli rv
• Say Bernoulli G has probability distribution Pr(G=1)=p, Pr(G=0)=1-p
• Recall E(G) = p
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• So var(G) = (0-p)2(1-p) + (1-p)2p = p(1-p)
1-11
Mean & Variance of linear function
• Say X is a rv, and Y a linear function of it: Y = a + bX
• Then Y is a rv in its own right
• Its mean E(Y) = E(a + bX) = E(a) + E(bX)
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• Its mean E(Y) = E(a + bX) = E(a) + E(bX)
= a + bE(X) … in short, µY = a+bµX
• Recall if c a constant: var(cX) = c2var(X) and var(c+X) = var(X), so …
• Var(Y) = var(a+bX) = var(bX) = b2var(X)
• σY = |b|σX upon taking square roots on both sides
1-12
Measures of symmetry and tails
Skewness(Y) =
= measure of asymmetry of a distribution
• skewness = 0: distribution is symmetric
• skewness > (<) 0: distribution has fatter right (left) tail
E Y − µY( )
3
σY
3
− µ( )4
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
Kurtosis(Y) =
= measure of mass in tails
= measure of probability of large values
• kurtosis = 3: normal distribution
• skewness > 3: heavy tails (“leptokurtotic”)
• Skew.(cY)=Skew.(Y), Kurt.(cY)=Kurt.(Y) “scale-invariant”
1-13
E Y − µY( )
4
σY
4
Copyright © 2011 Pearson Addison-Wesley. All rights reserved. 1-14
Two random variables: joint distributions and covariance
• The joint distribution of two random variables (say X and Y) is the probability/pdf of (X,Y) = (x,y) taking particular values, jointly.
• Say X = 0 (raining), 1 (not) & Y = 0 (long commute), 1 (not)
• Four outcomes for (X,Y): (0,0), (0,1), (1,0), (1,1)
Copyright © 2011 Pearson Addison-Wesley. All rights reserved. 1-15
Y↓\X-> Rain X=0 Dry X=1 total
Long Y=0 .15 .07 .22
Short Y=1 .15 .63 .78
total .3 .7
Two rvs: marginal dist’bn
• The marginal probability distribution of rv Y is its probability distribution, as X is free to take any value
• That is, Pr(Y=y) := ∑i Pr(X=xi, Y=y)
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• E.g. Pr(long commute) = .22 & Pr(rain) = .3
• Useful to compute expectations, variances, etc of Y:
• E(Y) = ∑i yi· pi = 0·Pr(Y=0) + 1·Pr(Y=1) = Pr(Y=1) = .78
1-16
Two rvs: conditional dist’bn
• The distribution of rv Y conditional on rv X taking a specific value is called the conditional distribution of Y given X .
• Written Pr(Y=y|X=x)
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• E.g. Pr(Y=0|X=0) = .5 (equally likely)
• Bayes’ formula: Pr(Y=y|X=x) = Pr(Y=y,X=x)/Pr(X=x)
• Indeed, Pr(Y=0,X=0)/Pr(X=0) = .15/.30 = .5
Note, the denominator uses the marginal dist’bn of X
1-17
Two rvs: conditional expectation
• The conditional expectation/mean of Y given X , E(Y|X=x), is the mean of Pr(Y|X=x)
• Discrete: E(Y|X=x):= ∑i Pr(Y=yi|X=x)·yi
• E(Y|X=1) = Pr(Y=0|X=1)·0+Pr(Y=1|X=1)·1 = .63/.7 = .9
• E(Y|X=0) = Pr(Y=0|X=0)·0+Pr(Y=1|X=0)·1 = .5·
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• E(Y|X=0) = Pr(Y=0|X=0)·0+Pr(Y=1|X=0)·1 = .5
• Law of iterated expectations: The mean of Y is the weighted average of E(Y|X=xi), with weights given by the probability distribution of X = x1, …, xk.
• That is, E(Y) = ∑i E(Y|X=xi)·Pr(X=xi)
• Compactly, E(Y) = E[E(Y|X)] E.g. E(Y) = .9·7 + .5·.3 = .78
1-18
·
Two rvs: conditional variance
• The conditional variance of Y given X , var(Y|X=x), is the variance of Pr(Y|X=x): E[ Y-E(Y|X=x) ]2
• E(Y|X=x) above is constant & underlying prob. is Pr(Y|X=x)
• Discrete: var(Y|X=x):= ∑i Pr(Y=yi|X=x)·[yi-E(Y|X=x)]2
·
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• Discrete: var(Y|X=x):= ∑i Pr(Y=yi|X=x)·[yi-E(Y|X=x)]2
• var(Y|X=1) = Pr(Y=0|X=1)·[0-E(Y|X=1)]2+Pr(Y=1|X=1)·[1-E(Y|X=1)]2 = .1·[0-.9]2+.9·[1-.9]2 = .081 + .009 = .09
1-19
·
Two rvs: independence
• Informally, rvs X, Y are independent if knowing the value of one yields no information about the value of the other.
• Precisely, they are independent if the conditional distribution of Y given X equals the marginal distribution of Y.
·
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• That is, if Pr(Y=y|X=x) = Pr(Y=y) for all possible x
• Using Bayes’ formula: Pr(Y=y,X=x) = Pr(Y=y)·Pr(X=x)
1-20
·
Aside: Standardizing a rv
• A common transformation of a rv is standardizing it:
• X into X:=(X-µx)/σx
• Deviations from mean, divided by standard deviation
• E(X)=0 and var(X)=1.
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• E(X)=0 and var(X)=1.
• Thus standardized rvs always have mean 0 and st.dev 1
• Exercise: If c>0 is a constant, then cX = X
• Thus we say this transformation is scale-invariant. If X is measuring time, whether in seconds, minutes or hours, the transformation gives the same result.
1-21
Two rvs: covariance
• A measure of how much two rvs X, Y vary together is this:
• Covariance between X and Y cov(X,Y):= E[(X-µx)(Y-µY)]
• It is also denoted σXY
• Expanding, cov(X,Y) = E(XY) – µxµY
• Note, (X-µ ) & (Y-µ ) are deviations from their means. ·
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• Note, (X-µx) & (Y-µY) are deviations from their means.
• Suppose when X tends to exceed its mean, so does Y tend to exceed its mean. Then the product is positive, and so is the covariance. Likewise, a negative covariance suggests that when X overperforms, Y underperforms, relative to means.
• Discrete: cov(X,Y) = ∑i ∑j (xi-µx)(yj-µY)·Pr(X=xi,Y=yj)
• Exercise: If X, Y are independent, cov = 0 (converse false)
• Exercise: cov(aX,Y)=acov(X,Y). Also, cov(a+X,Y)=cov(X,Y)
1-22
·
Two rvs: correlation
• Covariance E[(X-µx)(Y-µY)] involves variables in potentially different scales (eg. X in minutes, Y in hours), so the product makes little sense.
• However, recall that X =(X-µx)/σx and Y =(Y-µY)/σY are scale-invariant, so E(X·Y) makes more sense:
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• corr(X,Y):= E(X·Y) = … (bottom last slide) … = cov(X,Y)/σxσY
• Rvs are said to be uncorrelated if cov(X,Y) = 0. Then corr=0.
• Exercise: If E(Y|X) is independent of X (equal to µY), then X,Y are uncorrelated.
• Fact: Corr is always between -1 and +1
1-23
Correlation measures linear association
Copyright © 2011 Pearson Addison-Wesley. All rights reserved. 1-24
Mean and variance of sums of rvs
Some basic consequences of the definitions of E and var:
• E(X+Y) = E(x) + E(Y)
• E(a + bX) = a + bE(X)
• Var(a+bX) = b2var(X)
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y)
… so iff uncorrelated is it true Var(X+Y) = Var(X) + Var(Y)
• Var(X) = E(X2) - µ2X
• Cov(a + bX,Y) = bCov(X,Y)
• Cov(X,Y) = E(XY) - µX µY• Cov(X,X) = Var(X)
1-25
Key distributions: Normal
• The normal distribution with mean µ and variance σ2>0, denoted N(µ,σ2), is defined by the pdf
−−
2
2
1exp
2
1)(
σ
µ
πσ
yyfY
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• The factor preceding exp ensures probabilities sum to 1, ∫f(y)dy=1
• One can show that E(y)=µ, var(y)= σ2, skew. = 0, kurt. = 3
• Standard normal dist’n: Z=N(0,1), i.e. zero mean & unit var.
• Its cdf is denoted Φ, so Pr(y≤c)= Φ(c)
• Table of values of Φ in p.749-750.
• Table is relevant also for any normal N(µ,σ2), standardize it…
1-26
22 σπσ
Key distributions: Normal
• Say Y is normal, set Z: = (Y-µ)/σ, so Y = µ+ σZ.
• One can show that Z is N(0,1), so that Φ is relevant.
• For example, to look up Pr(Y≤D), note
−
Φ=−
≤=−
≤−
=≤µµµµ DD
ZDY
DY )Pr()Pr()Pr(
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• Which again you can look up on p.750, given D,µ,σ
• Since this is a cdf, Pr(Z>K) = 1-Pr(Z≤K) = 1-Φ(K)
• Also, to look up Pr(A<Z≤B), note
1-27
Φ=≤=≤=≤
σσσσZDY )Pr()Pr()Pr(
( ) ( ) ( ) ( ) ( )[ ] ( ) ( )ABBA ΦΦ=ΦΦ=>==≤< --1--1BZPr-AZPr-1BZAPr
Key distributions: normal
• An important feature is that normal dist’bns are closed under sums and scalings. That is, if X,Y are normals, and a,b are constants, then aX+bY also is normal.
• The mean of aX+bY we already know, from our work on expectations: Its mean is aµ +bµ
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
expectations: Its mean is aµX+bµY
• Its variance we also know, from before:
• Fact: If two normals are uncorrelated, they are independent
• Recall converse for any rv, if independent, then uncorrelated
1-28
( ) ( ) ( ) ( )YXabYbXabYaX ,cov2varvarvar22
++=+
Key distributions: normal
• Fact: If a set of rvs has a multivariate normal disb’n, then the marginal dist’bn of each is normal
• Fact: If X,Y have a bivariate normal dit’bn, then E(Y|X=x) is linear in x, i.e. E(Y|X=x)=a+bx for all x, and some constants a,b.
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
constants a,b.
1-29
Key distributions: Chi-squared
• The chi-squared distribution with m degrees of freedom is the dist’bn of a sum of m squared independent standard normal dist’bns. Denoted χm
2
• So if X,Y are standard normals, then X2 + Y2 is a chi-squared with df=2.
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
squared with df=2.
• Table on p.752 gives some values, given the percentile.
• We see the 95th percentile for a χ22 is 5.99
• The chi-squared will appear when we do tests. If we wish to test that a certain error term is statistically insignificant, and know that it has such a dist’bn, then the table will help us.
1-30
Key distributions: Student t
• The Student t distribution with m degrees of freedom is the dist’bn of the ratio Z/(χm
2 /m)1/2, the ratio of a standard normal over the square root of a chi-squared with df=m divided by m, where these are independent.
• It has the same shape as a normal, except for fatter tails,
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
• It has the same shape as a normal, except for fatter tails, which thin out the larger is m.
• A table with percentiles for the t dist’bn is on p. 751
1-31
Key distributions: F
• The F distribution with m,n degrees of freedom Fm,n is the dist’bn of the ratio (W/m)/(V/n) where W,V are independent chi-squared dist’bns with df=m,n respectively.
• Z/(χm2 /m)1/2, the ratio of a standard normal over the square
root of a chi-squared with df=m divided by m.
Copyright © 2011 Pearson Addison-Wesley. All rights reserved.
root of a chi-squared with df=m divided by m.
• A related dist’bn is the Fm,∞ = W/m, where W is a χm2
• When n is large, this is a good approximation.
• Tables on pp.753-6 give values of these F’s at various percentiles and given various df’s.
1-32