dr. héctor allendereview of probability and statistics 1 a review of probability and statistics...
Post on 19-Dec-2015
228 views
TRANSCRIPT
Dr. Héctor Allende Review of Probability and Statistics
1
A Review of Probability and Statistics
• Descriptive statistics
• Probability
• Random variables
• Sampling distributions
• Estimation and confidence intervals
• Test of Hypothesis–For mean, variances, and proportions–Goodness of fit
Dr. Héctor Allende Review of Probability and Statistics
2
Key Concepts
• Population -- "parameters"
–Finite
–Infinite
• Sample -- "statistics"
• Random samples - Your MOST important decision!
Dr. Héctor Allende Review of Probability and Statistics
3
Data
• Deterministic vs. Probabilistic (Stochastic)
• Discrete or Continuous:– Whether a variable is continuous (measured) or
discrete (counted) is a property of the data, not of the measuring device: weight is a continuous variable, even if your scale can only measure values to the pound.
• Data description:– Category frequency– Category relative frequency
Dr. Héctor Allende Review of Probability and Statistics
4
Data Types
• Qualitative (Categorical)
–Nominal -- I E = 1 ; EE = 2 ; CE = 3
–Ordinal -- poor = 1 ; fair = 2 ; good = 3 ; excellent = 4
• Quantitative (Numerical)
–Interval -- temperature, viscosity
–Ratio -- weight, height
• The type of statistics you can calculate depends on the data type. Average, median, and variance make no sense if the data is categorical (proportions do).
Dr. Héctor Allende Review of Probability and Statistics
5
Data Presentation for Qualitative Data
• Rules:– Each observation MUST fall in one and only one category.– All observations must be accounted for.
• Table -- Provides greater detail
• Bar graphs -- Consider Pareto presentation!
• Pie charts (do not need to be round)
Dr. Héctor Allende Review of Probability and Statistics
6
Data Presentation for Quantitative Data
• Consider a Stem-and-Leaf Display
• Use 5 to 20 classes (intervals, groups).
–Cell width, boundaries, limits, and midpoint
• Histograms
–Discrete–Continuous (frequency polygon - plot at class mark)
• Cumulative frequency distribution (Ogive - plot at upper boundary)
Dr. Héctor Allende Review of Probability and Statistics
7
Statistics
• Measures of Central Tendency– Arithmetic Mean– Median– Mode– Weighted mean
• Measures of Variation– Range– Variance– Standard Deviation
• Coefficient of Variation
• The Empirical Rule
Dr. Héctor Allende Review of Probability and Statistics
8
Arithmetic Mean and Variance -- Raw Data
• Mean
• Variance
S
y y
n
n y y
n n
ii i2
2
2 2
1 1
_
y
y
n
ii
n
_
1
Dr. Héctor Allende Review of Probability and Statistics
9
Arithmetic Mean and Variance -- Grouped Data
• Mean
• Variance
yf y
n
i ii
n
_
1
Sf y y
n
n f y f y
n n
n f y
i ii i i i
i i
2
2
2 2
1 1
_
where and = class midpoint
Dr. Héctor Allende Review of Probability and Statistics
10
Percentiles and Box-Plots
• 100pth percentile: value such that 100p% of the area under the relative frequency distribution lies below it.
– Q1: lower quartile (25% percentile)
– Q3: upper quartile (75% percentile)
• Box-Plots: limited by lower and upper quartiles– Whiskers mark lowest and highest values within 1.5*IQR from
Q1 or Q3
– Outliers: Beyond 1.5*IQR from Q1 or Q3 (mark with *)
– z-scores - deviation from mean in units of standard deviation. Outlier: absolute value of z-score > 3
Dr. Héctor Allende Review of Probability and Statistics
11
Probability: Basic Concepts
• Experiment: A process of OBSERVATION
• Simple event - An OUTCOME of an experiment that can not be decomposed
– “Mutually exclusive”– “Equally likely”
• Sample Space - The set of all possible outcomes
• Event “A” - The set of all possible simple events that result in the outcome “A”
Dr. Héctor Allende Review of Probability and Statistics
12
Probability • A measure of uncertainty of an estimate
– The reliability of an inference
• Theoretical approach - “A Priori”– Pr (Ai) = n/N
• n = number of possible ways “Ai” can be observed
• N = total number of possible outcomes
• Historical (empirical) approach - “A Posteriori”– Pr (Ai) = n/N
• n = number of times “Ai” was observed
• N = total number of observations
• Subjective approach – An “Expert Opinion”
Dr. Héctor Allende Review of Probability and Statistics
13
Probability Rules
• Multiplication Rule:– Number of ways to draw one element from set 1 which
contains n1 elements, then an element from set 2, ...., and finally an element from set k (ORDER IS IMPORTANT!):
n1* n2* ......* nk
0 Pr (A ) 1
Pr (A ) = 1
i
ii
Dr. Héctor Allende Review of Probability and Statistics
14
Permutations and Combinations• Permutations:
– Number of ways to draw r out of n elements WHEN
ORDER IS IMPORTANT:
• Combinations:– Number of ways to select r out of n items when order is
NOT important
Prn n
n r
!
( ) !
Crn n
r n r
!
! ( ) !
Dr. Héctor Allende Review of Probability and Statistics
15
Compound Events
}{)'(
Complement
}{)(
onIntersecti
}{)(
Union
AxxA
BandAxxBA
bothorBorAxxBA
Dr. Héctor Allende Review of Probability and Statistics
16
Conditional Probability
0)( )()()(
:Rule tiveMultiplica
0)( )(
)()(
BPprovidedBPBAPBAP
BPprovidedBP
BAPBAP
Dr. Héctor Allende Review of Probability and Statistics
17
Other Probability Rules
• Mutually Exclusive Events:
• Independence:– A and B are said to be statistically INDEPENDENT if
and only if:
P A B P A P B P A B( ) ( ) ( ) ( )
P A B( ) { }
)()()( BPAPBAP
Dr. Héctor Allende Review of Probability and Statistics
18
Bayes’ Rule
P A EP A P E A
P A P E Ai
i i
j j
j
( )( ) ( )
( ) ( )
Dr. Héctor Allende Review of Probability and Statistics
19
Random Variables
• Random variable: A function that maps every possible outcome of an experiment into a numerical value.
• Discrete random variable: The function can assume a finite number of values
• Continuous random variable: The function can assume any value between two limits.
Dr. Héctor Allende Review of Probability and Statistics
20
Probability Distribution for a Discrete Random Variable
• Function that assigns a value to the probability p(y) associated to each possible value of the random variable y.
0 1
1
p y
p yy
( )
( )
Dr. Héctor Allende Review of Probability and Statistics
21
Poisson Process
• Events occur over time (or in a given area, volume, weight, distance, ...)
• Probability of observing an event in a given unit of time is constant
• Able to define a unit of time small enough so that we can’t observe two or more events simultaneously.
• Tables usually give CUMULATIVE values!
Dr. Héctor Allende Review of Probability and Statistics
22
The Poisson Distribution
x is the number of events observed over T
is the expected number of events over T
e is the base of natural logs (2.71828)
= 2
Dr. Héctor Allende Review of Probability and Statistics
23
Poisson Approximation to the Binomial
• In a binomial situation where n is very large (n > 25) and p is very small (p < 0.30, and np < 15), we can approximate b(x, n, p) by a Poisson with probability ( lambda = np)
b y n pn
yp p P y n p
e n p
yy n y
n p y
( , , ) ( ) ( , )( )
!
1
Dr. Héctor Allende Review of Probability and Statistics
24
Probability Distribution for a Continuous Random Variable
• F( y0 ), is a cumulative distribution function that assigns a value to the probability of observing a value less or equal to y0
F y P y y f y dyy( ) ( ) ( )0 00
Property: F ( y ) is continuous over y
Dr. Héctor Allende Review of Probability and Statistics
25
Probability Calculations
P a y b f y dyab
f y d F ydy
f y y
f y dy
F y
P y a
( ) ( )
( ) [ ( )]
( )
( )
( )
( )
where f ( y ) is the density function of y
F(y)isthe(probability)distributionfunctionof y
iscontinuous
forallcontinuous r.v.(a constant)
0
1
0
Dr. Héctor Allende Review of Probability and Statistics
26
Expectations
Properties of Expectations
E y yp y discrete
E y y f y dy continuous
E g y g y f y dy
Variance E y E y
all y
( )
( ) ( )
[ ( ) ] ( ) ( )
[ ( ) ] ( )2 2 2 2
2Standard deviation
E c c
E cy c E y
E g y g y g y
E g y E g y
c
cy c y
k
k
( )
( ) ( )
[ ( ) ( ) ( ) ]
[ ( ) ] [ ( ) ]
( )
( ) ( )
1 2
1
2
2 2 2
0
Dr. Héctor Allende Review of Probability and Statistics
27
The Uniform Distribution
( ) ( )a b b a
22
2
12
A frequently used model when no data are available.
Dr. Héctor Allende Review of Probability and Statistics
28
The Triangular Distribution
A good model to use when no data are available. Just ask an expertto estimate the minimum, maximum, and most likely values.
Dr. Héctor Allende Review of Probability and Statistics
29
The Normal Distribution
z y
the standard normal variable
Tables provide cumulative values for the Standard Normal Distribution N ( = 0, = 1 )
Dr. Héctor Allende Review of Probability and Statistics
30
The Lognormal Distribution
Consider this model when 80 percent of the data valueslie in the first 20 % of the variable’s range.
Dr. Héctor Allende Review of Probability and Statistics
31
The Gamma Distribution
Properties: 2 2
Dr. Héctor Allende Review of Probability and Statistics
32
The Erlang Distribution
A special case of the Gamma Distribution when = k = integerA Poisson process where we are interested in the time to observe k events
Dr. Héctor Allende Review of Probability and Statistics
33
The Exponential Distribution
A special case of the Gamma Distribution when =1
Dr. Héctor Allende Review of Probability and Statistics
34
The Weibull Distribution
A good model for failure time distributions of manufactured items. It has a closed expression for F ( y ).
Dr. Héctor Allende Review of Probability and Statistics
35
The Beta Distribution
A good model for proportions. You can fit almost any data.However, the data set MUST be bounded!
Dr. Héctor Allende Review of Probability and Statistics
36
Bivariate Data (Pairs of Random Variables)
• Covariance: measures strength of linear relationship
• Correlation: a standardized version of the covariance
• Autocorrelation: For a single time series: Relationship between an observation and those immediately preceding it. Does current value (Xt) relate to itself lagged one period (Xt-1)?
Cov X Y E X E X Y E Y E XY E X E Y( , )
Cov X Y
X Y
,
Dr. Héctor Allende Review of Probability and Statistics
37
Sampling Distributions
The population has PARAMETERS
A sample yields STATISTICS X
A statistics is calculated based on the values observed in a sample.
Those values are random variables. Therefore, a statistics
is a RANDOM VARIABLE.
The sampling distribution of a statistic is its probability distribution.
The STANDARD ERROR of a statistic is the standard deviation of
its sampling distribution.
_
,
, S 2
See slides 8 and 9 for formulas to calculate sample means and variances (raw data and grouped data, simultaneously).
Dr. Héctor Allende Review of Probability and Statistics
38
The Sampling Distribution of the Mean (Central Limit Theorem)
The CENTRAL LIMIT THEOREM: If random samples
of size n are taken from a population having ANY distribution
with mean and standard deviation , then, when n is large
enough, the sample distribution of the mean can be approximated
by a normal density with mean and standard deviationY_
Y n_
Dr. Héctor Allende Review of Probability and Statistics
39
The Sampling Distribution of Sums
Let L a y a y a y
Assume E y Var y Cov y y
E L a a a
Var L a a a
a a a a a a
k k
i i i i i j ij
k k
k k
k k k k
1 1 2 2
2
1 1 2 2
1
2
1
2
2
2
2
2 2 2
1 2 12 1 3 13 1 1,2 2 2
.....
( ) , ( ) , ( , )
( ) .....
( ) .....
.....
Then L possesses a normal density with mean and variance:
Dr. Héctor Allende Review of Probability and Statistics
40
Distributions Related to Variances
For a sample with standard deviation S, the statistics
( )followsaChi squaredistr.with n 1.
For two independent samples, thestatistics
//
followsanF distributionwithparameters
inthenumerator and inthedenominator.
The sum of two chi - squares follows a chi - square
distribution with =
1 2
1
2
2
2
1
2
1
2
2
2
2
1
n S
F
Dr. Héctor Allende Review of Probability and Statistics
41
The t Distribution
Let z be a standard normal variable and be a chi - square
random variable with degrees of freedom. If z and are
independent, then t = z
is said to posses a
Student's distribution ("t-distribution") with df.
COROLLARY: For a random sample taken from a
normal population, t = y -
S / nfollowsat distribution
with df
2
2
2
/
.
Dr. Héctor Allende Review of Probability and Statistics
42
Estimation
• Point and Interval Estimators
• Properties of Point Estimators– Unbiased: E (estimator) = estimated parameter
Note: S2 is Unbiased if
– MVUE: Minimum Variance Unbiased Estimators
• Most frequently used method to estimate parameters: MLE - Maximum Likelihood Estimators.
E Y_
Dr. Héctor Allende Review of Probability and Statistics
43
Interval Estimators -- Large sample CI for mean
From the Central Limit Theorem:
Prob -z
After some algebraic manipulation we get:
Prob X X
/ 2
_ _
The ( 1 - ) * 100% Confidence Interval for
X
nz
zn
zn
_
/
/ /
/ 2
2 2
1
1
Dr. Héctor Allende Review of Probability and Statistics
44
Interval Estimators -- Small sample CI for mean
For small samples( n < 30 ):
Prob - t
After some algebraic manipulation we get:
Prob X X
/ 2
_ _
The ( 1 - ) * 100% Confidence Interval for (small samples)
X
S nt
tS
nt
S
n
_
/
/ /
/ 2
2 2
1
1
Dr. Héctor Allende Review of Probability and Statistics
45
Sample Size
Based on CI for the mean:
Recommendation:
Sample approximately 30
Estimate using S
Estimate n
Take more observations as needed.
2 2
nz z S
/ /2
2
2
2
Dr. Héctor Allende Review of Probability and Statistics
46
CI for proportions (large samples)
The distribution of a proportion is fairly normal with mean = p and
variance
Then, the C. I. for the population proportion is:
where p is the observed proportion of successes
Assumption: The interval does not contain 0 or 1.
2
^
p p
n
p p zp p
n
y
n
1
12
^
/
^ ^
( )
Dr. Héctor Allende Review of Probability and Statistics
47
Sample Size (proportions)
Based on CI for a proportion:
Recommendation:
Sample approximately 30
Estimate p
Estimate n
Take more observations as needed.
^
nz
p p
/
^ ^2
2
1
Dr. Héctor Allende Review of Probability and Statistics
48
CI for the variance
The statistics:
A Chi - Square distr. with = n - 1
After some algebraic manipulation:
Prob
Assumption: Population is approximately normal.
n S
n S n S
1
1 11
2
2
2
2
2
2
2
2
1 2
2
~
/ , ( / ),
Dr. Héctor Allende Review of Probability and Statistics
49
CI for the Difference of Two Means -- large samples --
The difference of two means follows a normal density with:
E Y Y
C.I. for = Y z
Y z
Assumptions: Independent samples with more than 30
observations each.
1 1
1 / 2
1 / 2
_ _ _ _
_ _
_ _
Y and Var Yn n
Yn n
YS
n
S
n
2 1 2 2
1
2
1
2
2
2
1 2 2
1
2
1
2
2
2
2
1
2
1
2
2
2
Dr. Héctor Allende Review of Probability and Statistics
50
CI for (p1 - p2) --- (large samples)
For large samples ( :
Approximation is good as long as neither interval includes
0 or 1.
1
^
1
^
1
n andn
p p z p p zp qn
p qnp p
2
2 2 2 2
1 1
1
2 2
2
30
1 2
)
^
/
^
/
^ ^ ^ ^
^ ^
Dr. Héctor Allende Review of Probability and Statistics
51
CI for the Difference of Two Means -- small samples, same variance --
C.I. for = Y1n
where S ("pooled variance")
Assumptions:
1. Independent samples taken from normal populations.
2. Variances are unknown but equal (
1 /2, n
1
p
2
1
2
1
1 2 2 2
2
1 1
2
2 2
2
1 2
2
2 2
2
1
1 1
1
_ _
( )
)
Y t Sn
n S n S
n n
n p
Dr. Héctor Allende Review of Probability and Statistics
52
CI for the Difference of Two Means -small samples, different variances-
C.I. for = YSn
a =Sn
Sn
(round down)
Assumptions: Independent samples taken from normal populations.
1 /2,
1
2
1
1
2
1
2
2
2
1 2 2
2
2
2
1
2
1
2
2
2
2
2
1
2
21 1
_ _
Y tSn
nd
Sn
Sn
n n
Dr. Héctor Allende Review of Probability and Statistics
53
CI for the Difference of Two Means -- matched pairs --
We have PAIRS of observations related through somecommon factor (Y , Y ):
Let d Y Y the observed difference for pair i
C.I. for
where and are the mean and the standard deviationof the n sample differences.
Assumptions: Random observations; the populationof paired differences is normally distributed.
1i 2i
i 1i 2i
d
d tSn
d S
n
d
d
_
/ ,
_
2 1
Dr. Héctor Allende Review of Probability and Statistics
54
CI for two variances
Recall:
After some algebraic manipulation:
Prob
F
n Sn
n Sn
S
SFn n
S
S
Fn n
12
1
22
2
11
12
12 1
1
21
22
22 2
1
12
12
22
22
11
21
12
22
11
2
/
/
/
/
~,
, 1 2
12
22
12
22
11
21 1 2
1
, / , , ( / )
( )
S
S
Fn n
Assumption: Independent samples from normal populations.
Dr. Héctor Allende Review of Probability and Statistics
55
Prediction Intervals
Consider the prediction of the value for the NEXT observation (not the
mean value but its actual value), e.g., we want a "confidence interval" for y
Consider the difference between this observation and the sample mean:
y
y
If the distribution of y is approximately normal, this difference will also be normal.
This yields the following "prediction interval" for the next observation, y
Pr
n +1
n +1 y
n +1
n + 1
n + 1
.
( ) ( )
( ) ( )
:
_ _
_ _
_E y E y E y
y y yn n
y
ny
n
1
2 2
1
2 2
2
2
0
11
_
/ ,
_
/ ,
t S
ny t S
nn n 2 1 2 1
11
11
1yn 1
Dr. Héctor Allende Review of Probability and Statistics
56
Hypothesis Testing
• Elements of a Statistical Test. Focus on decisions made when comparing the observed sample to a claim (hypotheses). How do we decide whether the sample disagrees with the hypothesis?
• Null Hypothesis, H0. A claim about one or more population parameters. What we want to REJECT.
• Alternative Hypothesis, Ha: What we test against. Provides criteria for rejection of H0.
• Test Statistic: computed from sample data.
• Rejection (Critical) Region, indicates values of the test statistic for which we will reject H0.
Dr. Héctor Allende Review of Probability and Statistics
57
Errors in Decision Making
True State of Nature
H0 Ha
Decision Dishonest client Honest client
Do not lend Correct decision Type II error
Lend Type I error Correct decision
Dr. Héctor Allende Review of Probability and Statistics
58
Statistical Errors
T y p e I e r r o r
T y p e I I e r r o r
P o w e r o f a s t a t i s t i c a l t e s t
( ): Rejecting a true
Null Hypothesis (producer's risk)
( ): Rejecting a true
Alternative Hypothesis (consumer's risk)
,
( 1 - ), is the probability of rejecting the
null hypothesis H when, in fact, H is false.0 0
Dr. Héctor Allende Review of Probability and Statistics
59
Statistical Tests
O n e - t a i l e d t e s t s
T w o - t a i l e d t e s t s
:
H H <
Rejection region: z > z
:
H H
Rejection region: z > z or
where z = / n
and P(z > z
0 a 0
0 a
: : ( )
( )
: :
)
/ /
_
0 0
0 0
2 2
0
or
or z z
z z
X
Dr. Héctor Allende Review of Probability and Statistics
60
The Critical Value
The sample size for specified and when testing H = versus
H is given by
n = z
Assumption: is the same under both hypotheses.
0 0
a a
:
:
z
a
22
0
2
Dr. Héctor Allende Review of Probability and Statistics
61
The observed significance level for a test
It is standard in industry to use = 0.05.
Some researchers prefer to report the observed
"p - value". This is the probability (under H
of observing the value of the test statistic. This
allows the reader to make his (her) own decision
about accepting or rejecting H
Most computer packages report the significance as
(for example) Prob > T
0
0
)
.
Dr. Héctor Allende Review of Probability and Statistics
62
Testing proportions (large samples)
H p pp pp p
nyn
H p p
n
a
0 0
0
0 0
0
1
1
:( )
( : )
( ) /
test statistic: z =
where p is the observed proportion of successes
Rejection region (example): z > z
Assumption: The interval p 2 p p
does not contain 0 or 1.
^
^
^ ^ ^
Dr. Héctor Allende Review of Probability and Statistics
63
Testing a Normal Mean
Select . Set your test as one- tailed or two- tailed.
Calculate test statistic: z = y y
Compare to the critical value (from book's table).
If sample is small ( n < 30 ):
Calculate test statistic: t =y
(ass umes an approximately normal population)
_ _
_
0 0
0
/ /
/
n S n
S n
Dr. Héctor Allende Review of Probability and Statistics
64
Testing a variance
H
n S
for H
for H
or for H
a
a
a
0
2
0
2
2
0
2
2 2
0
2
1
2 2
0
2
2
2
1 2
2 2
0
2
1:
:
:
:/ /
test statistic:
Rejection region:
Assumption: Population is approximately normal.
2
2
2
2 2
Dr. Héctor Allende Review of Probability and Statistics
65
Testing Differences of Two Means -- large samples --
H D
Y D
Sn
Sn
H DH D
or z z H D
a
a
a
0 1 2 0
2 0
1
2
1
2
2
2
1 2 0
1 2 0
2 2 1 2 0
:
:::
_ _
/ /
test statistic: z Y
Rejection region: z > z if
z < -z ifz > z if
Assumptions: Independent samples with more than 30observations each.
1
Dr. Héctor Allende Review of Probability and Statistics
66
Testing Differences of Two Means -- small samples, same variance --
H DY Y D
Sn n
H D
n S n S
n n
p
n n a
0 1 2 0
2 0
1 2
2 1 2 0
1 1
2
2 2
2
1 2
2
2 2
1 1
1 1
1
1 2
:
: )
( )
)
_ _
,
test statistic: t
Rejection region (example): t > t (
where S ("pooled variance")
Assumptions: 1. Indep. samples from normal populations.
2. Variances are unknown but equal (
1
p
2
1
2
Dr. Héctor Allende Review of Probability and Statistics
67
Testing Differences of Two Means -small samples, different variances-
H DY Y D
Sn
H D
where
Sn
Sn
n n
a
0 1 2 0
2 0
22
2
1 2 0
12
1
22
2
2
2
1
2
21 1
:
: )
_ _
,
test statistic: t Sn
Rejection region (example): t > t (
=Sn
Sn
(round down)
Assumptions: Independent samples taken from approximately normal populations.
1
12
1
12
1
22
2
Dr. Héctor Allende Review of Probability and Statistics
68
Testing Difference of Two Means -- matched pairs --
We have PAIRS of observations related through somecommon factor (Y ,Y ):
Let d Y Y the observed difference for pair i
H test statistic: t = d
Rejection region: t > t for H
where and are the mean and the standard deviationof the n sample differences.
Assumptions: Random observations; the populationof paired differences is normally distributed.
1i 2i
i 1i 2i
0 diff
_
a diff
:/
:,
_
1 2 0
0
1 1 2 0
DD
S n
D
d S
d
n
d
Dr. Héctor Allende Review of Probability and Statistics
69
Testing a ratio of two variances
H
test statistic: F = larger sample variancesmaller sample variance
Rejectionregion: F > F
F > F
Assumption: Independent samples from normal populations.
Note: Make sure the df in the numerator are those of
the sample with larger variance!
0: ( . ., )
:
:/
1
2
2
2 1
2
2
2
1
2
2
2
2 1
2
2
2
1
eg
for H
for Ha
a
Dr. Héctor Allende Review of Probability and Statistics
70
Testing (p1 - p2) --- (large samples)
For large samples ( :
test statistic: z =
Approximation is good as long as no interval includes 0 or 1.
1^
n and n
H p p Dp p D
when Dp qn
p qn
when D pqn n
and py yn n
p p
p p
p p
2
0 1 2 0
1 2 0
0
1 1
1
2 2
2
0
1 2
1 2
1 2
30
0
01 1
1 2
1 2
1 2
)
:( )
^
^ ^ ^ ^
^ ^ ^
^ ^
^ ^
^ ^
Dr. Héctor Allende Review of Probability and Statistics
71
Categorical Data
One-way Table: Categories and their frequencies:
Categ. 1 2 .. k Total Freq.
Large sample conf. int. for
Example: EE ME Others Total 17 11 9 37
Then
n n n n
p p znp p
p
p
k
i i i i
EE
EE
1 2
2
11
1737
196137
1737
2037
046 016
030 062
..
. . .
. .
^
/
^ ^
Dr. Héctor Allende Review of Probability and Statistics
72
One-way Tables (Cont.)
Large sample (1 - ) 100 % Conf. Int. for
In the example:
p p
p p p p zn
p p p p p p
p p
i j
i j i j i i j j i j
EE ME
:
( ) ( ) ( )
.
. .
^ ^
/
^ ^ ^ ^ ^ ^
2
11 1 2
1737
1137
196137
1737
2037
1137
2637
21737
1137
0162 0275
0113 0437
0045 0477
. .
: . . ,
p p
NOTE p p again
EE ME
EE Others
NOT significant!
difference is NOTsignificant!
Dr. Héctor Allende Review of Probability and Statistics
73
Categorical Data Analysis
General r x c Contingency Table 1 2 .. c Totals1 n(1,1) n(1,2) .. n(1,c) r (1)2 n(2,1) n(2,2) .. n(2,c) r (2).. .. .. .. .. ..r n(r,1) n(r,2) .. n(r,c) r (r)
Totals c(1) c(2) .. c(c) n
Dr. Héctor Allende Review of Probability and Statistics
74
Example of a Contingency Table
STA 3032 - Summer 1994Grade Q2 Q4 Q6 Total
0-2 13 0 2 152.1-4 6 1 1 84.1-6 8 5 11 246.1-8 4 7 9 208.1-10 2 16 6 24Total 33 29 29 91
Dr. Héctor Allende Review of Probability and Statistics
75
Testing for IndependenceH
0Variables are independent H
aThey are not
Test statistic: 2
where Rejection region:0.05, (r - 1) (c - 1)
Note: regroup rows (columns) as needed for
In the example: 2
: :
, .
nij
E nij
E nij
i
r
j
cn
nijricji
r
j
c
E nij
ricj
n
E nij
i j
2
11
2
111
2
5
91192
23 33
12
23 29
62
24 291 4133
0 05 612 5916
... .
. ,. (Note regrouping! Compare to from Table)
Conclusion: Variables are NOT independent.
Dr. Héctor Allende Review of Probability and Statistics
76
Distributions: Model Fitting Steps Collect data. Make sure you have a random sample.
You will need at least 30 valid cases Plot data. Look for familiar patterns Hypothesize several models for distribution Using part of the data, estimate model parameters Using the rest of the data, analyze the model’s
accuracy Select the “best” model and implement it Keep track of model accuracy over time. If warranted,
go back to 6 (or to 3, if data (population?) behavior keeps changing)
Dr. Héctor Allende Review of Probability and Statistics
77
Chi-Square Test of Goodness of Fit
H
At least one
Let n = sample size and the observed frequency in cell i
Make sure that e (if not, regroup cells as needed).
Test Statistic:
Rejection Region:
where: =k - r -1k =number of cells after regroupingr =number of parameters estimated from data to calculate
0
i
2
2
i0
: ; ; .... ;
:^
p p p p p p with p p
H p p
py
nnp i
n e
e
n np
np
p
k k i iii
a i i
ii
i
i i
ii
k i i
ii
k
1 10 2 20 0 0
0
2
1
0
2
01
2
1
5
Dr. Héctor Allende Review of Probability and Statistics
78
Kolmogorov-Smirnov Test of Goodness of Fit
Compares the empirical distribution function F with
a hypothesized theoretical distribution function F .
Empirical: F = fraction of the sample less or equal to y
= for the i ranked observation (contains y)
F
F
Then D = max F F
Critical values given in tables
n
n
n
( )
( )
( )
max ( )
max ( )
( ) ( ) max( , )
y
y
y
i
nth
Let Dn
y
D yi
n
y y D D
i
i
1
1
Dr. Héctor Allende Review of Probability and Statistics
79
A Review of Probability and Statistics
• Descriptive statistics
• Probability
• Random variables
• Sampling distributions
• Estimation and confidence intervals
• Test of Hypothesis–For mean, variances, and proportions–Goodness of fit