dr. héctor allendereview of probability and statistics 1 a review of probability and statistics...

Dr. Héctor Allende Review of Probability and Statistics

1

A Review of Probability and Statistics

• Descriptive statistics

• Probability

• Random variables

• Sampling distributions

• Estimation and confidence intervals

• Test of Hypothesis–For mean, variances, and proportions–Goodness of fit


2

Key Concepts

• Population -- "parameters"

–Finite

–Infinite

• Sample -- "statistics"

• Random samples - Your MOST important decision!


3

Data

• Deterministic vs. Probabilistic (Stochastic)

• Discrete or Continuous:– Whether a variable is continuous (measured) or

discrete (counted) is a property of the data, not of the measuring device: weight is a continuous variable, even if your scale can only measure values to the pound.

• Data description:– Category frequency– Category relative frequency


4

Data Types

• Qualitative (Categorical)

–Nominal -- I E = 1 ; EE = 2 ; CE = 3

–Ordinal -- poor = 1 ; fair = 2 ; good = 3 ; excellent = 4

• Quantitative (Numerical)

–Interval -- temperature, viscosity

–Ratio -- weight, height

• The type of statistics you can calculate depends on the data type. Average, median, and variance make no sense if the data is categorical (proportions do).


5

Data Presentation for Qualitative Data

• Rules:– Each observation MUST fall in one and only one category.– All observations must be accounted for.

• Table -- Provides greater detail

• Bar graphs -- Consider Pareto presentation!

• Pie charts (do not need to be round)


6

Data Presentation for Quantitative Data

• Consider a Stem-and-Leaf Display

• Use 5 to 20 classes (intervals, groups).

–Cell width, boundaries, limits, and midpoint

• Histograms

–Discrete–Continuous (frequency polygon - plot at class mark)

• Cumulative frequency distribution (Ogive - plot at upper boundary)


7

Statistics

• Measures of Central Tendency– Arithmetic Mean– Median– Mode– Weighted mean

• Measures of Variation– Range– Variance– Standard Deviation

• Coefficient of Variation

• The Empirical Rule


8

Arithmetic Mean and Variance -- Raw Data

• Mean

• Variance

S

y y

n

n y y

n n

ii i2

2

2 2

1 1

_

y

y

n

ii

n

_

1


9

Arithmetic Mean and Variance -- Grouped Data

• Mean

• Variance

yf y

n

i ii

n

_

1

Sf y y

n

n f y f y

n n

n f y

i ii i i i

i i

2

2

2 2

1 1

_

where and = class midpoint


10

Percentiles and Box-Plots

• 100pth percentile: value such that 100p% of the area under the relative frequency distribution lies below it.

– Q1: lower quartile (25% percentile)

– Q3: upper quartile (75% percentile)

• Box-Plots: limited by lower and upper quartiles– Whiskers mark lowest and highest values within 1.5*IQR from

Q1 or Q3

– Outliers: Beyond 1.5*IQR from Q1 or Q3 (mark with *)

– z-scores - deviation from mean in units of standard deviation. Outlier: absolute value of z-score > 3


11

Probability: Basic Concepts

• Experiment: A process of OBSERVATION

• Simple event - An OUTCOME of an experiment that can not be decomposed

– “Mutually exclusive”– “Equally likely”

• Sample Space - The set of all possible outcomes

• Event “A” - The set of all possible simple events that result in the outcome “A”


12

Probability • A measure of uncertainty of an estimate

– The reliability of an inference

• Theoretical approach - “A Priori”– Pr (Ai) = n/N

• n = number of possible ways “Ai” can be observed

• N = total number of possible outcomes

• Historical (empirical) approach - “A Posteriori”– Pr (Ai) = n/N

• n = number of times “Ai” was observed

• N = total number of observations

• Subjective approach – An “Expert Opinion”


13

Probability Rules

• Multiplication Rule:– Number of ways to draw one element from set 1 which

contains n1 elements, then an element from set 2, ...., and finally an element from set k (ORDER IS IMPORTANT!):

n1* n2* ......* nk

0 Pr (A ) 1

Pr (A ) = 1

i

ii


14

Permutations and Combinations• Permutations:

– Number of ways to draw r out of n elements WHEN

ORDER IS IMPORTANT:

• Combinations:– Number of ways to select r out of n items when order is

NOT important

Prn n

n r

!

( ) !

Crn n

r n r

!

! ( ) !


15

Compound Events

}{)'(

Complement

}{)(

onIntersecti

}{)(

Union

AxxA

BandAxxBA

bothorBorAxxBA


16

Conditional Probability

0)( )()()(

:Rule tiveMultiplica

0)( )(

)()(

BPprovidedBPBAPBAP

BPprovidedBP

BAPBAP


17

Other Probability Rules

• Mutually Exclusive Events:

• Independence:– A and B are said to be statistically INDEPENDENT if

and only if:

P A B P A P B P A B( ) ( ) ( ) ( )

P A B( ) { }

)()()( BPAPBAP


18

Bayes’ Rule

P A EP A P E A

P A P E Ai

i i

j j

j

( )( ) ( )

( ) ( )


19

Random Variables

• Random variable: A function that maps every possible outcome of an experiment into a numerical value.

• Discrete random variable: The function can assume a finite number of values

• Continuous random variable: The function can assume any value between two limits.


20

Probability Distribution for a Discrete Random Variable

• Function that assigns a value to the probability p(y) associated to each possible value of the random variable y.

0 1

1

p y

p yy

( )

( )


21

Poisson Process

• Events occur over time (or in a given area, volume, weight, distance, ...)

• Probability of observing an event in a given unit of time is constant

• Able to define a unit of time small enough so that we can’t observe two or more events simultaneously.

• Tables usually give CUMULATIVE values!


22

The Poisson Distribution

x is the number of events observed over T

is the expected number of events over T

e is the base of natural logs (2.71828)

= 2


23

Poisson Approximation to the Binomial

• In a binomial situation where n is very large (n > 25) and p is very small (p < 0.30, and np < 15), we can approximate b(x, n, p) by a Poisson with probability ( lambda = np)

b y n pn

yp p P y n p

e n p

yy n y

n p y

( , , ) ( ) ( , )( )

!

1


24

Probability Distribution for a Continuous Random Variable

• F( y0 ), is a cumulative distribution function that assigns a value to the probability of observing a value less or equal to y0

F y P y y f y dyy( ) ( ) ( )0 00

Property: F ( y ) is continuous over y


25

Probability Calculations

P a y b f y dyab

f y d F ydy

f y y

f y dy

F y

P y a

( ) ( )

( ) [ ( )]

( )

( )

( )

( )

where f ( y ) is the density function of y

F(y)isthe(probability)distributionfunctionof y

iscontinuous

forallcontinuous r.v.(a constant)

0

1

0


26

Expectations

Properties of Expectations

E y yp y discrete

E y y f y dy continuous

E g y g y f y dy

Variance E y E y

all y

( )

( ) ( )

[ ( ) ] ( ) ( )

[ ( ) ] ( )2 2 2 2

2Standard deviation

E c c

E cy c E y

E g y g y g y

E g y E g y

c

cy c y

k

k

( )

( ) ( )

[ ( ) ( ) ( ) ]

[ ( ) ] [ ( ) ]

( )

( ) ( )

1 2

1

2

2 2 2

0


27

The Uniform Distribution

( ) ( )a b b a

22

2

12

A frequently used model when no data are available.


28

The Triangular Distribution

A good model to use when no data are available. Just ask an expertto estimate the minimum, maximum, and most likely values.


29

The Normal Distribution

z y

the standard normal variable

Tables provide cumulative values for the Standard Normal Distribution N ( = 0, = 1 )


30

The Lognormal Distribution

Consider this model when 80 percent of the data valueslie in the first 20 % of the variable’s range.


31

The Gamma Distribution

Properties: 2 2


32

The Erlang Distribution

A special case of the Gamma Distribution when = k = integerA Poisson process where we are interested in the time to observe k events


33

The Exponential Distribution

A special case of the Gamma Distribution when =1


34

The Weibull Distribution

A good model for failure time distributions of manufactured items. It has a closed expression for F ( y ).


35

The Beta Distribution

A good model for proportions. You can fit almost any data.However, the data set MUST be bounded!


36

Bivariate Data (Pairs of Random Variables)

• Covariance: measures strength of linear relationship

• Correlation: a standardized version of the covariance

• Autocorrelation: For a single time series: Relationship between an observation and those immediately preceding it. Does current value (Xt) relate to itself lagged one period (Xt-1)?

Cov X Y E X E X Y E Y E XY E X E Y( , )

Cov X Y

X Y

,


37

Sampling Distributions

The population has PARAMETERS

A sample yields STATISTICS X

A statistics is calculated based on the values observed in a sample.

Those values are random variables. Therefore, a statistics

is a RANDOM VARIABLE.

The sampling distribution of a statistic is its probability distribution.

The STANDARD ERROR of a statistic is the standard deviation of

its sampling distribution.

_

,

, S 2

See slides 8 and 9 for formulas to calculate sample means and variances (raw data and grouped data, simultaneously).


38

The Sampling Distribution of the Mean (Central Limit Theorem)

The CENTRAL LIMIT THEOREM: If random samples

of size n are taken from a population having ANY distribution

with mean and standard deviation , then, when n is large

enough, the sample distribution of the mean can be approximated

by a normal density with mean and standard deviationY_

Y n_


39

The Sampling Distribution of Sums

Let L a y a y a y

Assume E y Var y Cov y y

E L a a a

Var L a a a

a a a a a a

k k

i i i i i j ij

k k

k k

k k k k

1 1 2 2

2

1 1 2 2

1

2

1

2

2

2

2

2 2 2

1 2 12 1 3 13 1 1,2 2 2

.....

( ) , ( ) , ( , )

( ) .....

( ) .....

.....

Then L possesses a normal density with mean and variance:


40

Distributions Related to Variances

For a sample with standard deviation S, the statistics

( )followsaChi squaredistr.with n 1.

For two independent samples, thestatistics

//

followsanF distributionwithparameters

inthenumerator and inthedenominator.

The sum of two chi - squares follows a chi - square

distribution with =

1 2

1

2

2

2

1

2

1

2

2

2

2

1

n S

F


41

The t Distribution

Let z be a standard normal variable and be a chi - square

random variable with degrees of freedom. If z and are

independent, then t = z

is said to posses a

Student's distribution ("t-distribution") with df.

COROLLARY: For a random sample taken from a

normal population, t = y -

S / nfollowsat distribution

with df

2

2

2

/

.


42

Estimation

• Point and Interval Estimators

• Properties of Point Estimators– Unbiased: E (estimator) = estimated parameter

Note: S2 is Unbiased if

– MVUE: Minimum Variance Unbiased Estimators

• Most frequently used method to estimate parameters: MLE - Maximum Likelihood Estimators.

E Y_


43

Interval Estimators -- Large sample CI for mean

From the Central Limit Theorem:

Prob -z

After some algebraic manipulation we get:

Prob X X

/ 2

_ _

The ( 1 - ) * 100% Confidence Interval for

X

nz

zn

zn

_

/

/ /

/ 2

2 2

1

1


44

Interval Estimators -- Small sample CI for mean

For small samples( n < 30 ):

Prob - t

After some algebraic manipulation we get:

Prob X X

/ 2

_ _

The ( 1 - ) * 100% Confidence Interval for (small samples)

X

S nt

tS

nt

S

n

_

/

/ /

/ 2

2 2

1

1


45

Sample Size

Based on CI for the mean:

Recommendation:

Sample approximately 30

Estimate using S

Estimate n

Take more observations as needed.

2 2

nz z S

/ /2

2

2

2


46

CI for proportions (large samples)

The distribution of a proportion is fairly normal with mean = p and

variance

Then, the C. I. for the population proportion is:

where p is the observed proportion of successes

Assumption: The interval does not contain 0 or 1.

2

^

p p

n

p p zp p

n

y

n

1

12

^

/

^ ^

( )


47

Sample Size (proportions)

Based on CI for a proportion:

Recommendation:

Sample approximately 30

Estimate p

Estimate n

Take more observations as needed.

^

nz

p p

/

^ ^2

2

1


48

CI for the variance

The statistics:

A Chi - Square distr. with = n - 1

After some algebraic manipulation:

Prob

Assumption: Population is approximately normal.

n S

n S n S

1

1 11

2

2

2

2

2

2

2

2

1 2

2

~

/ , ( / ),


49

CI for the Difference of Two Means -- large samples --

The difference of two means follows a normal density with:

E Y Y

C.I. for = Y z

Y z

Assumptions: Independent samples with more than 30

observations each.

1 1

1 / 2

1 / 2

_ _ _ _

_ _

_ _

Y and Var Yn n

Yn n

YS

n

S

n

2 1 2 2

1

2

1

2

2

2

1 2 2

1

2

1

2

2

2

2

1

2

1

2

2

2


50

CI for (p1 - p2) --- (large samples)

For large samples ( :

Approximation is good as long as neither interval includes

0 or 1.

1

^

1

^

1

n andn

p p z p p zp qn

p qnp p

2

2 2 2 2

1 1

1

2 2

2

30

1 2

)

^

/

^

/

^ ^ ^ ^

^ ^


51

CI for the Difference of Two Means -- small samples, same variance --

C.I. for = Y1n

where S ("pooled variance")

Assumptions:

1. Independent samples taken from normal populations.

2. Variances are unknown but equal (

1 /2, n

1

p

2

1

2

1

1 2 2 2

2

1 1

2

2 2

2

1 2

2

2 2

2

1

1 1

1

_ _

( )

)

Y t Sn

n S n S

n n

n p


52

CI for the Difference of Two Means -small samples, different variances-

C.I. for = YSn

a =Sn

Sn

(round down)

Assumptions: Independent samples taken from normal populations.

1 /2,

1

2

1

1

2

1

2

2

2

1 2 2

2

2

2

1

2

1

2

2

2

2

2

1

2

21 1

_ _

Y tSn

nd

Sn

Sn

n n


53

CI for the Difference of Two Means -- matched pairs --

We have PAIRS of observations related through somecommon factor (Y , Y ):

Let d Y Y the observed difference for pair i

C.I. for

where and are the mean and the standard deviationof the n sample differences.

Assumptions: Random observations; the populationof paired differences is normally distributed.

1i 2i

i 1i 2i

d

d tSn

d S

n

d

d

_

/ ,

_

2 1


54

CI for two variances

Recall:

After some algebraic manipulation:

Prob

F

n Sn

n Sn

S

SFn n

S

S

Fn n

12

1

22

2

11

12

12 1

1

21

22

22 2

1

12

12

22

22

11

21

12

22

11

2

/

/

/

/

~,

, 1 2

12

22

12

22

11

21 1 2

1

, / , , ( / )

( )

S

S

Fn n

Assumption: Independent samples from normal populations.


55

Prediction Intervals

Consider the prediction of the value for the NEXT observation (not the

mean value but its actual value), e.g., we want a "confidence interval" for y

Consider the difference between this observation and the sample mean:

y

y

If the distribution of y is approximately normal, this difference will also be normal.

This yields the following "prediction interval" for the next observation, y

Pr

n +1

n +1 y

n +1

n + 1

n + 1

.

( ) ( )

( ) ( )

:

_ _

_ _

_E y E y E y

y y yn n

y

ny

n

1

2 2

1

2 2

2

2

0

11

_

/ ,

_

/ ,

t S

ny t S

nn n 2 1 2 1

11

11

1yn 1


56

Hypothesis Testing

• Elements of a Statistical Test. Focus on decisions made when comparing the observed sample to a claim (hypotheses). How do we decide whether the sample disagrees with the hypothesis?

• Null Hypothesis, H0. A claim about one or more population parameters. What we want to REJECT.

• Alternative Hypothesis, Ha: What we test against. Provides criteria for rejection of H0.

• Test Statistic: computed from sample data.

• Rejection (Critical) Region, indicates values of the test statistic for which we will reject H0.


57

Errors in Decision Making

True State of Nature

H0 Ha

Decision Dishonest client Honest client

Do not lend Correct decision Type II error

Lend Type I error Correct decision


58

Statistical Errors

T y p e I e r r o r

T y p e I I e r r o r

P o w e r o f a s t a t i s t i c a l t e s t

( ): Rejecting a true

Null Hypothesis (producer's risk)

( ): Rejecting a true

Alternative Hypothesis (consumer's risk)

,

( 1 - ), is the probability of rejecting the

null hypothesis H when, in fact, H is false.0 0


59

Statistical Tests

O n e - t a i l e d t e s t s

T w o - t a i l e d t e s t s

:

H H <

Rejection region: z > z

:

H H

Rejection region: z > z or

where z = / n

and P(z > z

0 a 0

0 a

: : ( )

( )

: :

)

/ /

_

0 0

0 0

2 2

0

or

or z z

z z

X


60

The Critical Value

The sample size for specified and when testing H = versus

H is given by

n = z

Assumption: is the same under both hypotheses.

0 0

a a

:

:

z

a

22

0

2


61

The observed significance level for a test

It is standard in industry to use = 0.05.

Some researchers prefer to report the observed

"p - value". This is the probability (under H

of observing the value of the test statistic. This

allows the reader to make his (her) own decision

about accepting or rejecting H

Most computer packages report the significance as

(for example) Prob > T

0

0

)

.


62

Testing proportions (large samples)

H p pp pp p

nyn

H p p

n

a

0 0

0

0 0

0

1

1

:( )

( : )

( ) /

test statistic: z =

where p is the observed proportion of successes

Rejection region (example): z > z

Assumption: The interval p 2 p p

does not contain 0 or 1.

^

^

^ ^ ^


63

Testing a Normal Mean

Select . Set your test as one- tailed or two- tailed.

Calculate test statistic: z = y y

Compare to the critical value (from book's table).

If sample is small ( n < 30 ):

Calculate test statistic: t =y

(ass umes an approximately normal population)

_ _

_

0 0

0

/ /

/

n S n

S n


64

Testing a variance

H

n S

for H

for H

or for H

a

a

a

0

2

0

2

2

0

2

2 2

0

2

1

2 2

0

2

2

2

1 2

2 2

0

2

1:

:

:

:/ /

test statistic:

Rejection region:

Assumption: Population is approximately normal.

2

2

2

2 2


65

Testing Differences of Two Means -- large samples --

H D

Y D

Sn

Sn

H DH D

or z z H D

a

a

a

0 1 2 0

2 0

1

2

1

2

2

2

1 2 0

1 2 0

2 2 1 2 0

:

:::

_ _

/ /

test statistic: z Y

Rejection region: z > z if

z < -z ifz > z if

Assumptions: Independent samples with more than 30observations each.

1


66

Testing Differences of Two Means -- small samples, same variance --

H DY Y D

Sn n

H D

n S n S

n n

p

n n a

0 1 2 0

2 0

1 2

2 1 2 0

1 1

2

2 2

2

1 2

2

2 2

1 1

1 1

1

1 2

:

: )

( )

)

_ _

,

test statistic: t

Rejection region (example): t > t (

where S ("pooled variance")

Assumptions: 1. Indep. samples from normal populations.

2. Variances are unknown but equal (

1

p

2

1

2


67

Testing Differences of Two Means -small samples, different variances-

H DY Y D

Sn

H D

where

Sn

Sn

n n

a

0 1 2 0

2 0

22

2

1 2 0

12

1

22

2

2

2

1

2

21 1

:

: )

_ _

,

test statistic: t Sn

Rejection region (example): t > t (

=Sn

Sn

(round down)

Assumptions: Independent samples taken from approximately normal populations.

1

12

1

12

1

22

2


68

Testing Difference of Two Means -- matched pairs --

We have PAIRS of observations related through somecommon factor (Y ,Y ):

Let d Y Y the observed difference for pair i

H test statistic: t = d

Rejection region: t > t for H

where and are the mean and the standard deviationof the n sample differences.

Assumptions: Random observations; the populationof paired differences is normally distributed.

1i 2i

i 1i 2i

0 diff

_

a diff

:/

:,

_

1 2 0

0

1 1 2 0

DD

S n

D

d S

d

n

d


69

Testing a ratio of two variances

H

test statistic: F = larger sample variancesmaller sample variance

Rejectionregion: F > F

F > F

Assumption: Independent samples from normal populations.

Note: Make sure the df in the numerator are those of

the sample with larger variance!

0: ( . ., )

:

:/

1

2

2

2 1

2

2

2

1

2

2

2

2 1

2

2

2

1

eg

for H

for Ha

a


70

Testing (p1 - p2) --- (large samples)

For large samples ( :

test statistic: z =

Approximation is good as long as no interval includes 0 or 1.

1^

n and n

H p p Dp p D

when Dp qn

p qn

when D pqn n

and py yn n

p p

p p

p p

2

0 1 2 0

1 2 0

0

1 1

1

2 2

2

0

1 2

1 2

1 2

30

0

01 1

1 2

1 2

1 2

)

:( )

^

^ ^ ^ ^

^ ^ ^

^ ^

^ ^

^ ^


71

Categorical Data

One-way Table: Categories and their frequencies:

Categ. 1 2 .. k Total Freq.

Large sample conf. int. for

Example: EE ME Others Total 17 11 9 37

Then

n n n n

p p znp p

p

p

k

i i i i

EE

EE

1 2

2

11

1737

196137

1737

2037

046 016

030 062

..

. . .

. .

^

/

^ ^


72

One-way Tables (Cont.)

Large sample (1 - ) 100 % Conf. Int. for

In the example:

p p

p p p p zn

p p p p p p

p p

i j

i j i j i i j j i j

EE ME

:

( ) ( ) ( )

.

. .

^ ^

/

^ ^ ^ ^ ^ ^

2

11 1 2

1737

1137

196137

1737

2037

1137

2637

21737

1137

0162 0275

0113 0437

0045 0477

. .

: . . ,

p p

NOTE p p again

EE ME

EE Others

NOT significant!

difference is NOTsignificant!


73

Categorical Data Analysis

General r x c Contingency Table 1 2 .. c Totals1 n(1,1) n(1,2) .. n(1,c) r (1)2 n(2,1) n(2,2) .. n(2,c) r (2).. .. .. .. .. ..r n(r,1) n(r,2) .. n(r,c) r (r)

Totals c(1) c(2) .. c(c) n


74

Example of a Contingency Table

STA 3032 - Summer 1994Grade Q2 Q4 Q6 Total

0-2 13 0 2 152.1-4 6 1 1 84.1-6 8 5 11 246.1-8 4 7 9 208.1-10 2 16 6 24Total 33 29 29 91


75

Testing for IndependenceH

0Variables are independent H

aThey are not

Test statistic: 2

where Rejection region:0.05, (r - 1) (c - 1)

Note: regroup rows (columns) as needed for

In the example: 2

: :

, .

nij

E nij

E nij

i

r

j

cn

nijricji

r

j

c

E nij

ricj

n

E nij

i j

2

11

2

111

2

5

91192

23 33

12

23 29

62

24 291 4133

0 05 612 5916

... .

. ,. (Note regrouping! Compare to from Table)

Conclusion: Variables are NOT independent.


76

Distributions: Model Fitting Steps Collect data. Make sure you have a random sample.

You will need at least 30 valid cases Plot data. Look for familiar patterns Hypothesize several models for distribution Using part of the data, estimate model parameters Using the rest of the data, analyze the model’s

accuracy Select the “best” model and implement it Keep track of model accuracy over time. If warranted,

go back to 6 (or to 3, if data (population?) behavior keeps changing)


77

Chi-Square Test of Goodness of Fit

H

At least one

Let n = sample size and the observed frequency in cell i

Make sure that e (if not, regroup cells as needed).

Test Statistic:

Rejection Region:

where: =k - r -1k =number of cells after regroupingr =number of parameters estimated from data to calculate

0

i

2

2

i0

: ; ; .... ;

:^

p p p p p p with p p

H p p

py

nnp i

n e

e

n np

np

p

k k i iii

a i i

ii

i

i i

ii

k i i

ii

k

1 10 2 20 0 0

0

2

1

0

2

01

2

1

5


78

Kolmogorov-Smirnov Test of Goodness of Fit

Compares the empirical distribution function F with

a hypothesized theoretical distribution function F .

Empirical: F = fraction of the sample less or equal to y

= for the i ranked observation (contains y)

F

F

Then D = max F F

Critical values given in tables

n

n

n

( )

( )

( )

max ( )

max ( )

( ) ( ) max( , )

y

y

y

i

nth

Let Dn

y

D yi

n

y y D D

i

i

1

1


79

A Review of Probability and Statistics

• Descriptive statistics

• Probability

• Random variables

• Sampling distributions

• Estimation and confidence intervals

• Test of Hypothesis–For mean, variances, and proportions–Goodness of fit

dr. héctor allendereview of probability and statistics 1 a review of probability and statistics...

Documents

type of statistics

raw data mean variance

data presentation

quantitative data

data deterministic

data description

statistics random samples

round slide