css 211: statistical methods i -...
TRANSCRIPT
CSS 211: Statistical Methods I
Zhaoxian Zhou
School of ComputingUniversity of Southern Mississippi
January 11, 2018
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 1 / 227
Overview
1 Looking at Data — Distributions
2 Looking at Data — Relationships
3 Producing Data
4 Probability: the Study of Randomness
5 Sampling Distributions
6 Introduction to Inference
7 Inference for Distributions
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 2 / 227
Objectives
☛ Displaying distributions with graphsVariables; Types of variables; Graphs for categorical variables (Bargraphs, Pie charts); Graphs for quantitative variables (Histograms,Stemplots, Stemplots versus histograms); Interpreting histograms;Time plots
☛ Describing distributions with numbersMeasures of center (mean, median); Mean versus median; Measuresof spread (quartiles, standard deviation); Five-number summary andboxplot; Choosing among summary statistics; Changing the unit ofmeasurement
☛ Density curves and Normal distributionsDensity curves; Measuring center and spread for density curves;Normal distributions; The 68-95-99.7 rule; Standardizingobservations; Using the standard Normal Table; Inverse Normalcalculations; Normal quantile plots
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 3 / 227
Basic Concepts
☛ Statistics: the science of learning from data.
☛ Cases: the objects described by the set of data. Can be individuals,companies, animals, plants, or any object of interest.
☛ Variable: a characteristic of a case.
☞ Categorical: something that falls into one of several categories.Example: blood type, hair color, first language
☞ Quantitative: something that takes numerical values for whicharithmetic operations, such as adding and averaging, make sense.Example: age, height, blood pressure
☞ Choose appropriate variable that measures what you want it to,eg, rate and count of occurrences
☛ Label : a special variable used in some data sets to distinguish thedifferent cases.
☛ The distribution of a variable tells us what values the variable takesand how often it takes these values.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 4 / 227
Displaying Distribution with Graphs
☛ Ways to chart categorical data
☞ Bar graph: each category is represented by a bar☞ Pie chart: the slices must represent the parts of one whole.
☛ Ways to chart quantitative data
☞ Stemplot, also called a stem-and-leaf plot. Each observation isrepresented by a stem, consisting of all digits except the finalone, which is the leaf.
☞ Histogram: breaks the range of values of a variable into classesand displays only the count or percent of the observations thatfall into each class.
☞ Time plot: plots each observation against the time at which itwas measured.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 5 / 227
Bar Graphs and Pie Charts for Categorical Variables
☛ Bar graph: each category is represented by a bar
☛ Pie chart: the slices must represent the parts of one whole, and allpercents add up to 100.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 6 / 227
Stemplots for Quantitative Variables
☛ Leaf: final digit; Stem: all others
☛ Write the stems in vertical column with smallest at the top, and drawa vertical at the right.
☛ Write each leaf in the row to the right of its stem, in increasing orderfrom the stem.
☛ Example:
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 7 / 227
Stemplots — Variations
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 8 / 227
Histograms
☛ good for large data sets
☛ breaks the range of values into classes; shows the number ofindividual data points that fall in each interval.
☛ Example:
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 9 / 227
Examining Distributions
☛ Look for overall pattern and for striking deviations from the pattern
☛ Describe the overall pattern by shape, center , and spread .
☛ Look for outlier , an individual value that falls outside the overallpattern. A large gap in the distribution is typically a sign of an outlier.Explain any outliers: errors in recording data? Equipment failure?
☛ Modes: major peaks.
☛ A distribution is symmetric if the right and left sides of the histogramare approximately mirror images of each other.A distribution is skewed to the right if the right side of the histogramextends much farther out than the left side.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 10 / 227
Time Plots
☛ Plot of each observation against time at which it was measured
☛ Reveal trends or other changes over time, despite small irregularities.
☛ The time is on horizontal scale.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 11 / 227
Time Plots — An Example
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 12 / 227
Describing Distributions with Numbers — MeasuringCenter
☛ The mean or average:
x =x1 + x2 + · · ·+ xn
nor x =
1
n
∑
i
xi
☛ The median M: the midpoint of a distribution, the number such thathalf of the observations are smaller and half are larger
☞ Sort all observations☞ Find the midpoint value
If n is odd: x n+12;
If n is even: the mean of x n2and x n
2+1.
☛ Example?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 13 / 227
Comparing the Mean and the Median
☛ The median is a measure of center that is resistant to skew andoutliers. The mean is not.Example: Powerball jackpot of $1.5B and its effect to the averageand the median household income of US ($43,585 in 2012).
☛ Symmetric distribution: the mean equals to the median.
☛ Skewed distribution: the mean is farther out in the long tail than themedian.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 14 / 227
Describing Distributions with Numbers — MeasuringSpread or Variability
☛ The 50th percentile: the median
☛ The upper quartile: the median of the upper half of the data
☛ The lower quartile: the median of the lower half of the data
☛ The pth percentile of a distribution is the value that has p percent ofthe observations fall at or below it.
☛ The quartiles Q1 and Q3:
☞ Sort the observations in increasing order.☞ Find the median M.☞ Find Q1: the median of the left half of the data☞ Find Q3: the median of the right half of the data
☛ Interquartile range: IQR = Q3 − Q1
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 15 / 227
Five-Number Summary and Boxplot
☛ The five-number summary of a set of observations isMinimum Q1 M Q3 Maximum
☛ A boxplot is a graph of the five-number summary.
☛ An example:
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 16 / 227
Suspected Outlier and Modified Boxplot
☛ The 1.5× IQR rule: suspected outlier if it falls more than 1.5× IQRabove the third quartile or below the first quartile.
☛ A modified boxplot: the lines extend out from the center box only tothe smallest and largest observations that are not flagged by the1.5× IQR rule.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 17 / 227
Standard Deviation
☛ The variance(s2) is defined as
s2 =1
n − 1
∑
i
(xi − x)2
☛ The standard deviation(s) is defined as
s =
√
1
n − 1
∑
i
(xi − x)2
☛ The number n− 1 is the degree of free-dom of the variance or standard deviation.
☛ s measures the spread about the mean.s = 0 means there is no spread.
☛ s is not resistant to outliers and has thesame units as the original observations.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 18 / 227
Example: Calculate the Mean and the Standard Deviation
☛ Given a data set: x = [1, 3, 5, 4, 6, 3, 6, 3, 5, 4]
☛ Calculate the mean:
x =1 + 3 + 5 + 4 + 6 + 3 + 6 + 3 + 5 + 4
10=
40
10= 4.
☛ Calculate standard deviation
x 1 3 5 4 6 3 6 3 5 4
x − x -3 -1 1 0 2 -1 2 -1 1 0
(x − x)2 9 1 1 0 4 1 4 1 1 0
Variance
σ2 =9 + 1 + 1 + 0 + 4 + 1 + 4 + 1 + 1 + 0
10− 1=
22
9.
Standard deviation
σ =
√
22
9= 1.5635.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 19 / 227
Linear Transformation of Data
☛ A linear transformation
xnew = a+ bx ,
where a shifts the value of x upward or downward; b changes the sizeof the unit of the measurement.
☛ Linear transformation does not change the shape of a distribution.
☛ Multiplication only: the measure of center (mean and median) andthe measure of the spread (interquartile range and standard deviation)are multiplied by b.
☛ Addition only: the measures of center and percentiles are added by a,but not the measures of spread.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 20 / 227
Linear Transformation — Example 1
☛ Fahrenheit is a thermodynamic temperature scale, where the freezingpoint of water is 32 degrees Fahrenheit and the boiling point is 212degrees at standard atmospheric pressure.
☛ Celsius and Fahrenheit scales are related by
◦F = 32 + 1.8×◦ C .
☛ An example of distributions of the same temperature set is
−40 −20 0 20 40 60 800
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
CF
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 21 / 227
Linear Transformation — Example 2
☛ Curved scores y and raw scores x are related by
y = 1.2x + 20.
Suppose the mean and the standard deviation of the raw scores are
x = 50, σx = 10.
☛ The mean and standard deviation of the curved score are
y =??;σy =??
☛ How about the median and interquartile of the curved score?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 22 / 227
Density Curves
☛ A density curve: a smooth fitting curve to the data, a mathematicalmodel of a distribution.
☛ The area between the curve and the horizontal axis is 1.
☛ The area under the curve and above any range of values is theproportion of all observations that fall in that range.
☛ Example:
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 23 / 227
Center of Density Curves
☛ The median of a density curve is the equal-areas point; the point thatdivides the area under the curve in half.
☛ The mean of a density curve is the balance point if made of solidmaterial.
☛ The median and mean are the same for a symmetric density curve;the mean of a skewed curve is pulled in the direction of the long tail.
☛ Example:
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 24 / 227
Normal Curves
☛ The Normal (Gaussian) distribution: symmetric, unimodal, andbell-shaped
☛ All normal curves N(µ, σ) have the same overall shape: the height ofthe curve at point x is given by
f (x) =1
σ√2π
e−12(
x−µσ )
2
.
☛ Example:
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 25 / 227
The 68-95-99.7 Rule
In the normal distributions with mean µ and standard deviation σ,
☛ Approximately 68% of the observations fall within σ of the mean µ.
☛ Approximately 95% of the observations fall within 2σ of the mean µ.
☛ Approximately 99.7% of the observations fall within 3σ of the mean µ.
☛ Example:
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 26 / 227
The Standard Normal Curves
Because all Normal distributions share the same properties, we canstandardize our data to transform any Normal curve N(µ, σ) into thestandard Normal curve N(0, 1).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 27 / 227
Standardizing and z−Scores
☛ All normal distributions N(µ, σ) are the same if measured in units ofsize σ about the mean µ.
☛ Standard normal distribution is the normal distribution N(0, 1).
☛ z−score: the standard value of x is
z =x − µ
σ, which can be positive, zero, or negative.
☛ Example: The height of young women are normal with µ = 64.5inches and σ = 2.5 inches.
☞ The standard z−score is
z =x − 64.5
2.5.
☞ For height x = 68 inches, the z−score is 1.4☞ For height x = 60 inches, the z−score is -1.8
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 28 / 227
Normal Distributions
☛ Cumulative proportion: the proportion of observations in adistribution that lie at or below a given value.
☛ When the distribution is given by a density curve, the cumulativeproportion is the area under the curve to the left of a given value.
☛ Example:
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 29 / 227
Using the Standard Normal Table
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 30 / 227
Normal Distribution Calculations — Illustrations
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 31 / 227
Normal Distribution Calculations — Example
☛ Question: SAT scores are normal distribution N(1026, 209). Howmany students are in range (720, 820)?
☛ Solution:
☞ Step 1: standardize
720 ≤ x ≤ 820 =⇒ 720− 1026
209≤ Z ≤ 820− 1026
209
or−1.46 ≤ Z ≤ −0.99
☞ Step 2: use the standard normal table
0.1611− 0.0721 = 0.0890 ≈ 9%.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 32 / 227
Normal Distribution Calculations — Inverse Problem
☛ Question: SAT scores satisfy N(505, 110).What score can place a student at top 10%?
☛ Solution:
☞ Step 1: top 10% means higher than90%.
☞ Step 2: use the standard normal table,find that for value 0.9, the correspondingz is 1.28.
☞ Step 3: unstandardizing:
x − 505
110= 1.28
=⇒ x = 505 + 1.28× 110 = 645.8
☞ The general rule for unstandardizing is
x = µ+ zσ.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 33 / 227
Overview
1 Looking at Data — Distributions
2 Looking at Data — Relationships
3 Producing Data
4 Probability: the Study of Randomness
5 Sampling Distributions
6 Introduction to Inference
7 Inference for Distributions
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 34 / 227
Objectives
☛ ScatterplotsScatterplots; Explanatory and response variables; Interpretingscatterplots; Outliers; Categorical variables in scatterplots; Scatterplotsmoothers
☛ CorrelationThe correlation coefficient; Influential points
☛ Least-squares regression Regression lines; Prediction andExtrapolation; Correlation and r2
☛ Cautions about correlation and regressionResiduals; Outliers and influential points; Lurking variables;Correlation/regression using averages
☛ Data analysis for two-way tablesTwo-way tables; Joint distributions; Marginal distributions;Conditional distributions; Simpson’s paradox
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 35 / 227
Introduction
☛ In last chapter we discussed the distributions of single variable.
☛ In this chapter we will discuss relationships between pairs of variables.
☛ Both variables are quantitative?– If yes, we use scatterplots for graphical display.
☛ Both variables are associated?– If yes, we examine correlation (calculate correlation coefficient).
☛ For mathematical modelling:– We use least-square regression to fit the data.
☛ Both variables are categorical?– If yes, we do data analysis for two-way tables.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 36 / 227
Examining Relationships
Most statistical studies involve more than one variable.Questions:
☛ What cases do the data describe?
☛ What variables are present and how are they measured?
☛ Are all of the variables quantitative?
☛ Do some of the variables explain or even cause changes in othervariables?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 37 / 227
Looking at Relationships
☛ Start with a graph
☛ Look for an overall pattern and deviations from the pattern
☛ Use numerical descriptions of the data and overall pattern (ifappropriate)
☛ Example:
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 38 / 227
Scatterplots
☛ One axis represents each of the variables, and the data are plotted as pointson the graph.
☛ Describe the relationship by examining the form, direction, and strength ofthe association.
☞ Form: linear, curved, clusters, no pattern☞ Direction: positive, negative, no direction☞ Strength: how closely the points fit the form☞ Deviations from that pattern: outliers
☛ Example:
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 39 / 227
Form of an Association
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 40 / 227
Direction of an Association
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 41 / 227
Strength of an Association
The strength of the relationship between the two variables can be seen byhow much variation, or scatter , there is around the main form.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 42 / 227
Outliers
An outlier is a data value that has a very low probability of occurrence(i.e., it is unusual or unexpected).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 43 / 227
Example
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 44 / 227
Scatterplot Smoothers
When an association is more complex than linear, we can still describe theoverall pattern by smoothing the scatterplot.
☛ Simply average the y values separately for each x value
☛ When a data set does not have many y values for a given x, softwaresmoothers form an overall pattern by looking at the y values forpoints in the neighborhood of each x value. Smoothers are resistantto outliers.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 45 / 227
Correlation Coefficient
☛ Definition
r =1
n − 1
n∑
i=1
(xi − x
σx
)
︸ ︷︷ ︸
z−score for xi
×(yi − y
σy
)
︸ ︷︷ ︸
z−score for yi
.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 46 / 227
Correlation Coefficient — Example
☛ Given data set(1,2),(2,5),(3,6),(4,8),(5,11),(6,12),(7,16),(8,18),(9,19),(10,20),which has n = 10 pairs of data
☛ Step 1: Calculate means and standard deviations for X and YMean for x data,
x =1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10
10= 5.5;
Similarly, mean for y data, y = 11.7.Standard deviation for x data,
σx =
√
1
10− 1[(1− 5.5)2 + · · ·+ (10− 5.5)2] = 3.0277;
Similarly for y data, σy = 6.3779.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 47 / 227
Example — Cont’d
☛ Step 2: Calculate z-scores of xi and yi , and the product of thez-scores for each data point:Point 1:zx = 1−5.5
3.0277 = −1.4863; zy = 2−11.76.3779 = −1.5209; zxzy = 2.2605
Point 2:zx = 2−5.5
3.0277 = −1.1560; zy = 5−11.76.3779 = −1.0505; zxzy = 1.2144
continue until the last pointPoint 10:zx = 10−5.5
3.0277 = 1.4863; zy = 20−11.76.3779 = 1.3014; zxzy = 1.9342
☛ Step 3: Calculate correlation coefficient by adding all products anddividing by n − 1:
r =2.2605 + 1.2144 + · · ·+ 1.9342
10− 1= 0.9926
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 48 / 227
Correlation Coefficient Properties
☛ The correlation coefficient is a measure of the direction (sign) andstrength (absolute value) of a linear relationship.
☛ It is calculated using the mean and the standard deviation of both thex and y variables.
☛ Correlation can only be used to describe quantitative variables.Categorical variables do not have means and standard deviations.
☛ The correlation coefficient treats x and y symmetrically, it does notdistinguish x and y .
☛ The correlation coefficient is unitless.
☛ Allows us to compare correlations between data sets where variablesare measured in different units or when variables are different.
☛ Correlations are calculated using means and standard deviations, andthus are NOT resistant to outliers.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 49 / 227
Range of Correlation Coefficient
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 50 / 227
Regression Line
☛ Correlation tells us about strength (scatter) and direction of the linearrelationship between two quantitative variables.
☛ We would like to have a numerical description of how both variablesvary together.
☛ A regression line is a straight line that describes how a responsevariable y changes as an explanatory variable x changes.
☛ Use a regression line to predict the value of y for a given value of x .How does y change as x changes?
☛ In regression, the distinction between explanatory and responsevariables is important.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 51 / 227
Concept of Least-Square Regression
☛ There are many lines that fit the data. But which line best describesthe data?
☛ The least-squares regression line is the unique line such that the sumof the squared vertical (y) distances between the data points and theline is as small as possible (least). Exceptionally helpful in statistics.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 52 / 227
Least-Square Regression Line
☛ The least-squares regression line is given by
y = b0 + b1x , where
b1 = rσyσx
=1
n − 1
∑(xi − x)(yi − y)∑
(xi − x)2=
1
n − 1
σxyσ2xx
; b0 = y − b1x .
r is the correlation, σ is the standard deviation, x and y are means.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 53 / 227
Example — Least-Square Regression Line
☛ Data set (1,2),(2,5),(3,6),(4,8),(5,11),(6,12),(7,16),(8,18),(9,19),(10,20)
☛ Step 1: Calculate (in previous example)
x = 5.5; y = 11.7;σx = 3.0277;σy = 6.3779; r = 0.9926.
☛ Step 2: Calculate the slope of the least-square regression line:
b1 = rσy
σx
= 0.9926× 6.3779
3.0277= 2.0909.
☛ Step 3: Calculate the y-intercept of the least-square regression line:
b0 = y − b1x = 11.7− 2.0909× 5.5 = 0.2000.
0 2 4 6 8 10 120
5
10
15
20
25
y=2.0909 x+0.2000
x
y
dataleast square regression
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 54 / 227
Making Predictions — Interpolation
☛ Interpolation: the equation of the least-squares regression allows youto predict y for any x within the range studied.
☛ Example: nobody in the study drank 6.5 beers, but by finding thevalue of y from the regression line for x = 6.5 we would expect ablood alcohol content of
y = 0.0144× 6.5 + 0.0008 = 0.094mg/ml .
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 55 / 227
Making Predictions — Extrapolation
☛ Extrapolation is the use of a regression line for predictions outside therange of x values used to obtain the line.
☛ Extrapolation can be wrong.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 56 / 227
Correlation and Regression
☛ Correlation measures the spread (scatter) in both the x and ydirections in the linear relationship. Deal with multiplication ofz-scores.
☛ Regression examines the variation in the response variable (y) givenchange in the explanatory variable (x). If the y−intercept is zero,recall that
b1 = rσyσx
and y = b1x =⇒ y
σy= r
x
σx.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 57 / 227
Coefficient of Determination
☛ Coefficient of determination (r2): the square of the correlationcoefficient.
☛ r2 represents the percentage of the variance in y (vertical scatterfrom the regression line) that can be explained by changes in x , or
r2 =variance of the predicted value y
variance of the observed value y.
.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 58 / 227
Coefficient of Determination — Example
☛ Assume first data set (1,1), (2,2), (3,3) then
x = 2, σx = 1; zx = −1, 0, 1; y = 2, σy = 1; zy = −1, 0, 1; r = 1.
b1 = rσy
σx
= 1; b0 = y − b1x = 0;=⇒ LS regression line: y = x .
☛ Assume second data set (1,1), (2,5), (3,3), then
x = 2, σx = 1; zx = −1, 01; y = 3, σy = 2; zy = −1, 1, 0; r = 0.5.
b1 = rσy
σx
= 1; b0 = y − b1x = 1;=⇒ LS regression line: y = x + 1.
☛ For the second data set, predicted values are 2, 3, 4.
σ2pred =
(2− 3)2 + (3− 3)3 + (4− 3)2
3− 1= 1;σ2
obsv = 4; r 2 = 0.25.
☛ How about the first data set?
0 1 2 3 4−1
0
1
2
3
4
5
6
0 1 2 3 4−1
0
1
2
3
4
5
6
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 59 / 227
Coefficient of Determination — Examples
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 60 / 227
Residuals and Residual Plots
☛ Residual : the distances from each point to the least-square regressionline.
☛ It is the contribution of individual data points to the overall patternof scatter.
☛ If residuals are scattered randomly around 0, the data fit a linearmodel, normally distributed, and there are no outliers.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 61 / 227
Examples of Residual Plots
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 62 / 227
Outliers and Influential Points
☛ Outlier : observation that lies outside the overall pattern ofobservations.
☛ Influential individual : observation that markedly changes theregression if removed. This is often an outlier on the x-axis.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 63 / 227
Two-Way Tables
☛ Two-way tables: organize data about two categorical variablesobtained from a two-way, or block, design. (There are now two waysto group the data).
☛ We call education the row variable and age group the column variable.☛ Each combination of values for these two variables is called a cell .☛ For each cell, we can compute a proportion by dividing the cell entry
by the total sample size. The collection of these proportions would bethe joint distribution of the two variables.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 64 / 227
Marginal Distributions
☛ The marginal distribution, expressed in counts or percentages, is thedistribution of a single categorical variable in a two-way table.
☛ The marginal distributions can be displayed on separate bar graphs,typically expressed as percents instead of raw counts. Each graphrepresents only one of the two variables, completely ignoring thesecond one.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 65 / 227
Conditional Distributions
☛ The Conditional distribution: the distribution of other variablesconditioning on the value of one variable.
☛ The conditional distributions can be graphically compared using sideby side bar graphs of one variable for each value of the other variable.
☛ Example:
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 66 / 227
Conditional vs. Marginal Distributions
☛ Conditional distribution is a probability distribution for asub-population. In other words, it shows the probability that arandomly selected item in a sub-population has a characteristic youreinterested in.
☛ Marginal distributions are the totals for the probabilities. They arefound in the margins.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 67 / 227
Simpson’s Paradox
☛ Simpson’s paradox : an association or comparison that holds for all ofseveral groups can reverse direction when the data are combined toform a single group, usually because of lurking variable—a variablethat is not among the explanatory or response variables in a studyand yet may influence the interpretation of relationships among thosevariables.
☛ Example:
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 68 / 227
Overview
1 Looking at Data — Distributions
2 Looking at Data — Relationships
3 Producing Data
4 Probability: the Study of Randomness
5 Sampling Distributions
6 Introduction to Inference
7 Inference for Distributions
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 69 / 227
Objectives
☛ Design of experiments.Anecdotal and available data; Comparative experiments;Randomization; Randomized comparative experiments; Cautionsabout experimentation; Matched pairs designs; Block designs
☛ Sampling designs; Toward statistical inferenceSampling methods; Simple random samples; Stratified samples;Caution about sampling surveys; Population versus sample; Towardstatistical inference; Sampling variability; Capture-recapture sampling
☛ EthicsInstitutional review boards; Informed consent; Confidentiality; Clinicaltrials; Behavioral and social science experiments
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 70 / 227
Obtaining Data
☛ Available data: data that were produced in the past for some otherpurpose but that may help answer a present question inexpensively.The library and the Internet are sources of available data.
☛ Anecdotal evidence: is based on haphazardly selected individual cases,which we tend to remember because they are unusual in some way.They also may not be representative of any larger group of cases.Example: In 2013 in US, the one-year odds of death from all motorvehicle accidents is one in 8,938; one-year odds of death fromlightning is one in 13,744,732. The odds of winning the Powerballjackpot is one in 292 million.
☛ Some questions require data produced specifically to answer them.This leads to designing observational or experimental studies.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 71 / 227
Basic Concepts
☛ Population: the entire group of individuals in which we are interestedbut can’t usually assess directly. Example: how long can LED bulbslast?
☛ Sample: the part of the population we actually examine and for whichwe do have data. How well the sample represents the populationdepends on the sample design.
☛ A parameter : a number describing a characteristic of the population.
☛ A statistic is a number describing a characteristic of a sample.
☛ Observational study : record data on individuals without attemptingto influence the responses. Example: what is the is average life spanof the items in the sample?
☛ Experimental study : deliberately impose a treatment on individualsand record their responses. Influential factors can be controlled.Example: what is the effect of a drug to a desease?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 72 / 227
Observational Studies vs. Experiments
☛ Observational studies are essential sources of data on a variety oftopics. However, when our goal is to understand cause and effect,experiments are the only source of fully convincing data.
☛ Two variables are confounded when their effects on a responsevariable cannot be distinguished from each other.
☛ Example: If we simply observe cell phone use and brain cancer, anyeffect of radiation on the occurrence of brain cancer is confoundedwith lurking variables such as age, occupation, and place of residence.
☛ Well designed experiments take steps to defeat confounding.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 73 / 227
Terminology
☛ The individuals in an experiment are the experimental units. If theyare human, we call them subjects.
☛ In an experiment, we do something to the subject and measure theresponse. The “something” we do is a called a treatment, or factor .
☛ If the experiment involves giving two different doses of a drug, we saythat we are testing two levels of the factor.
☛ A response to a treatment is statistically significant if it is larger thanyou would expect by chance (due to random variation among thesubjects).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 74 / 227
Comparative Experiments
☛ Experiments are comparative in nature: we compare the response to atreatment to
☞ another treatment☞ no treatment (a control)☞ a placebo☞ or any combination of the above
☛ A control is a situation where no treatment is administered. It servesas a reference mark for an actual treatment (e.g., a group of subjectsdoes not receive any drug or pill of any kind).
☛ A placebo is a fake treatment, such as a sugar pill. This is to test thehypothesis that the response to the actual treatment is due to theactual treatment and not the subject’s apparent treatment.The “placebo effect” is an improvement in health not due to anytreatment, but only to the patient’s belief that he or she will improve.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 75 / 227
Designing Controlled Experiments
☛ Example: experiment to evaluate the success of various fertilizertreatments was worthless because of poor experimental design:
☞ Fertilizer had been applied to a field one year and not another, inorder to compare the yield of grain produced in the two years
☞ Fertilizer was applied to one field and not to a nearby field in thesame year.
☛ Fisher’s solution: Randomized comparative experiments
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 76 / 227
Randomization
☛ One way to randomize an experiment is to rely on random digits tomake choices in a neutral way. We can use a table of random digits orthe random sampling function of a statistical software.
☛ Randomly choose n individuals from a group of N
☞ Label each of the N individuals with a number (typically from 1to N, or 0 to N − 1).
☞ A list of random digits is parsed into digits the same length as N.☞ The parsed list is read in sequence and the first n digits
corresponding to a label in our group of N are selected.☞ The n individuals within these labels constitute the selection.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 77 / 227
Randomization — Example
Problem: randomly select five students from a class of 20.
☛ List and number all students as 01, 02, · · · 20.☛ The number 20 is two digits long, so parse the list of random digits
into numbers that are two digits long. Here we chose to start withline 103 for no particular reason.
☛ Randomly choose five students by reading through the list oftwo-digit random numbers, starting with line 103 and on.
☛ The first five random numbers that match the numbers assigned tostudents make our selection.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 78 / 227
Principles of Experimental Design
☛ Three big ideas of experimental design:
☞ Control the effects of lurking variables on the response, simply bycomparing two or more treatments.
☞ Randomize: use impersonal chance to assign subjects totreatments.
☞ Replicate each treatment on enough subjects to reduce chancevariation in the results.
☛ Statistical significance: an observed effect so large that it wouldrarely occur by chance is called statistically significant.
☛ Completely randomized experimental designs: individuals arerandomly assigned to groups, then the groups are randomly assignedto treatments.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 79 / 227
Biased Design
The design of a study is biased if it systematically favors certain outcomes.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 80 / 227
Caution about Experimentation
☛ Ways to remove bias:
☞ Randomize the design: both the individuals and treatments areassigned randomly.
☞ A double-blind experiment: neither the subjects nor theexperimenter know which individuals got which treatment untilthe experiment is completed. The goal is to avoid forms ofplacebo effects and biases based on interpretation.
☞ Replicate your experiment: ensures that particular results are notdue to uncontrolled factors or errors of manipulation.
☛ Lack of realism is a serious weakness of experimentation. Thesubjects or treatments or setting of an experiment may notrealistically duplicate the conditions we really want to study. In thatcase, we cannot generalize about the conclusions of the experiment.Example: studying the effects of hair spray on rats to determine whatwill happen to women with big hair.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 81 / 227
Block Designs
In a block , or stratified , design, subjects are divided into groups, or blocks,prior to experiments, to test hypotheses about differences between thegroups.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 82 / 227
Matched Pairs Designs
☛ In a Matched pairs: choose pairs of subjects that are closely matched- e.g., same sex, height, weight, age, and race. Within each pair,randomly assign who will receive which treatment.
☛ It is also possible to just use a single person, and give the twotreatments to this person over time in random order. In this case, the“matched pair” is just the same person at different points in time.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 83 / 227
Sampling Methods
☛ Convenience sampling : just ask whoever is around.Bias: Opinions limited to individuals present.
☛ Voluntary Response Sampling : often called public opinion polls, theseare not considered valid or scientific, because different people aremotivated to respond or not.
☛ Probability or random sampling : individuals are randomly selected.No one group should be over-represented.Sampling randomly gets rid of bias.
☛ A simple random sample (SRS) is made of randomly selectedindividuals. Each individual in the population has the same probabilityof being in the sample. All possible samples of size n have the samechance of being drawn.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 84 / 227
Sampling Methods — Stratified Samples
☛ A stratified random sample is essentially a series of SRSs performedon subgroups of a given population.
☛ The subgroups are chosen to contain all the individuals with a certaincharacteristic. Examples:
☞ Divide the population of USM students into males and females.☞ Divide the population of California by major ethnic group.
☛ The SRS taken within each group in a stratified random sample neednot be of the same size. For example:
☞ A stratified random sample of 100 male and 150 female USMstudents
☞ A stratified random sample of a total of 100 Californians,representing proportionately the major ethnic groups
☛ Multistage samples use multiple stages of stratification.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 85 / 227
Caution about Sampling Surveys
☛ Nonresponse: people who feel they have something to hide or whodon’t like their privacy being invaded probably won’t answer. Yet theyare part of the population.
☛ Response bias: fancy term for lying when you think you should nottell the truth, or forgetting. This is particularly important when thequestions are very personal (e.g., “How much do you drink?”) orrelated to the past.
☛ Wording effects: questions worded like “Do you agree that it is awfulthat · · · ” are prompting you to give a particular response.
☛ Undercoverage: occurs when parts of the population are left out inthe process of choosing the sample.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 86 / 227
Toward Statistical Inference
The techniques of inferential statistics allow us to draw inferences orconclusions about a population in a sample.
☛ Your estimate of the population is only as good as your samplingdesign. So work hard to eliminate biases.
☛ Your sample is only an estimate and if you randomly sampled againyou would probably get a somewhat different result.
☛ The bigger the sample the better.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 87 / 227
Sampling Variability
☛ Each time we take a random sample from a population, we are likelyto get a different set of individuals and a different statistic. This iscalled sampling variability .
☛ The good news is that, if we take lots of random samples of the samesize from a given population, the variation from sample to sample —the sampling distribution — will follow a predictable pattern. All ofstatistical inference is based on this knowledge.
☛ The variability of a statistic is described by the spread of its samplingdistribution. This spread depends on the sampling design and thesample size n, with larger sample sizes leading to lower variability.Statistics from large samples are almost always close estimates of thetrue population parameter. However, this only applies to randomsamples.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 88 / 227
Sampling Variability — Cont’d
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 89 / 227
Capture — Recapture Sampling
☛ Repeated sampling can be used to estimate the size N of a population.
☛ Example: What is the number of a bird species (least flycatcher) migratingalong a major route? Least flycatchers are caught in nets, tagged, andreleased. The following year, the birds are caught again and the numberstagged versus not tagged recorded.
☛ Solution: the proportion of tagged birds in the sample should be areasonable estimate of the proportion of tagged birds in the population.
☛ This works well if both samples are SRSs from the population and thepopulation remains unchanged between samples. In practice, however, someof the birds tagged last year died before this year’s migration.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 90 / 227
Ethics — Institutional Review Boards
☛ The organization that carries out the study must have an institutionalreview board that reviews all planned studies in advance in order toprotect the subjects from possible harm.
☛ The purpose of an institutional review board is “to protect the rightsand welfare of human subjects (including patients) recruited toparticipate in research activities”.
☛ The institutional review board:
☞ reviews the plan of study☞ can require changes☞ reviews the consent form☞ monitors progress at least once a year
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 91 / 227
Ethics — Informed Consent
☛ All subjects must give their informed consent before data arecollected.
☛ Subjects must be informed in advance about the nature of a studyand any risk of harm it might bring.
☛ Subjects must then consent in writing.
☛ Who can’t give informed consent?
☞ prison inmates☞ very young children☞ people with mental disorders
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 92 / 227
Ethics — Confidentiality
☛ All individual data must be kept confidential. Only statisticalsummaries may be made public.
☛ Confidentiality is not the same as anonymity. Anonymity preventsfollow-ups to improve non-response or inform subjects of results.
☛ Separate the identity of the subjects from the rest of the dataimmediately!
☛ Example: Citizens are required to give information to the government(tax returns, social security contributions). Some people feel thatindividuals should be able to forbid any other use of their data, evenwith all identification removed.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 93 / 227
Ethics — Clinical Trials
☛ Clinical trials study the effectiveness of medical treatments on actualpatients - these treatments can harm as well as heal.
☛ Points for a discussion:
☞ Randomized comparative experiments are the only way to see thetrue effects of new treatments.
☞ Most benefits of clinical trials go to future patients. We mustbalance future benefits against present risks.
☞ The interests of the subject must always prevail over theinterests of science and society.
☛ In the 1930s, the Public Health Service Tuskegee study recruited 399poor blacks with syphilis and 201 without the disease in order toobserve how syphilis progressed without treatment. The PublicHealth Service prevented any treatment until word leaked out andforced an end to the study in the 1970s.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 94 / 227
Ethics — Behavioral and Social Science Experiments
☛ Many behavioral experiments rely on hiding the true purpose of thestudy.
☛ Subjects would change their behavior if told in advance whatinvestigators were looking for.
☛ The “Ethical Principles” of the American Psychological Associationrequire consent unless a study merely observes behavior in a publicspace.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 95 / 227
Overview
1 Looking at Data — Distributions
2 Looking at Data — Relationships
3 Producing Data
4 Probability: the Study of Randomness
5 Sampling Distributions
6 Introduction to Inference
7 Inference for Distributions
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 96 / 227
Objectives
☛ Randomness
☛ Probability modelsProbability and Randomness; Sample spaces; Probability rules;Assigning probabilities: finite number of outcomes; Assigningprobabilities: equally likely outcomes; Independence andmultiplication rule
☛ Random variables
☛ Means and variances of random variablesDiscrete random variables; Continuous random variables; Normalprobability distributions; Mean of a random variable; Law of largenumbers; Variance of a random variable; Rules for means andvariances
☛ General probability rulesGeneral addition rules; Conditional probability; General multiplicationrules; Tree diagrams; Bayes’s rule
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 97 / 227
Basic Concepts
☛ A phenomenon is random if individual outcomes are uncertain, butthere is nonetheless a regular distribution of outcomes in a largenumber of repetitions.
☛ The probability of any outcome of a random phenomenon can bedefined as the proportion of times the outcome would occur in a verylong series of repetitions.
☛ Two events are independent if the probability that one event occurson any given trial of an experiment is not affected or changed by theoccurrence of the other event.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 98 / 227
Example — Coin Toss
The result of any single coin toss is random. But the result over manytosses is predictable, as long as the trials are independent (i.e., theoutcome of a new coin flip is not influenced by the result of the previousflip).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 99 / 227
Understanding Probability — Sometimes Not Obvious
☛ Example: Monty Hall problem (three-door game) is a counter-intuitive statisticspuzzle. There are 3 doors, behind which are two goats and a car. You pick a door(1), hoping for the car. Monty Hall opens the one with a goat of the other two (2and 3). Here’s the game: Do you stick with door 1 (original guess) or switch tothe other unopened door? Does it matter?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 100 / 227
Understanding Probability — Sometimes Not Obvious
☛ Example: Monty Hall problem (three-door game) is a counter-intuitive statisticspuzzle. There are 3 doors, behind which are two goats and a car. You pick a door(1), hoping for the car. Monty Hall opens the one with a goat of the other two (2and 3). Here’s the game: Do you stick with door 1 (original guess) or switch tothe other unopened door? Does it matter?
☛ Calculation: Door 1 gives you 1/3 of winning chance if you stick to it, and 2/3 ofloosing, which will win if you switch.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 100 / 227
Understanding Probability — Sometimes Not Obvious
☛ Example: Monty Hall problem (three-door game) is a counter-intuitive statisticspuzzle. There are 3 doors, behind which are two goats and a car. You pick a door(1), hoping for the car. Monty Hall opens the one with a goat of the other two (2and 3). Here’s the game: Do you stick with door 1 (original guess) or switch tothe other unopened door? Does it matter?
☛ Calculation: Door 1 gives you 1/3 of winning chance if you stick to it, and 2/3 ofloosing, which will win if you switch.
☛ Still cannot not image why you need to switch? — How about 1000 doors?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 100 / 227
Understanding Probability — Sometimes Not Obvious
☛ Example: Monty Hall problem (three-door game) is a counter-intuitive statisticspuzzle. There are 3 doors, behind which are two goats and a car. You pick a door(1), hoping for the car. Monty Hall opens the one with a goat of the other two (2and 3). Here’s the game: Do you stick with door 1 (original guess) or switch tothe other unopened door? Does it matter?
☛ Calculation: Door 1 gives you 1/3 of winning chance if you stick to it, and 2/3 ofloosing, which will win if you switch.
☛ Still cannot not image why you need to switch? — How about 1000 doors?
☛ So it is very clear? — Explain why switch is NOT 1/3 of winning.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 100 / 227
Probability Models
☛ Probability models describe, mathematically, the outcome of randomprocesses. They consist of two parts:
☞ S = Sample Space: This is a set, or list, of all possible outcomesof a random process. An event is a subset of the sample space.
☞ A probability for each possible event in the sample space S.
☛ Example 1: Probability model for a coin toss:
☞ S = Head, Tail☞ Probability of heads = 0.5☞ Probability of tails = 0.5
☛ Example 2: Probability model for a two-coin toss event:
☞ Ordered?☞ Non-ordered?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 101 / 227
Probability Rules — Addition Rule for Disjoint Events
☛ Probabilities range from 0 (no chance of the event) to 1 (the eventhas to happen). For any event A, 0 ≤ P(A) ≤ 1.
☛ Because some outcome must occur on every trial, the sum of theprobabilities for all possible outcomes (the sample space) must beexactly 1. P(sample space) = 1
☛ Two events A and B are disjoint if they have no outcomes in commonand can never happen together. The probability that A or B occurs isthen the sum of their individual probabilities.P(A or B) = P(A ∪ B) = P(A) + P(B). This is the addition rule fordisjoint events.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 102 / 227
Probability Rules — Complement Rule
☛ The complement of any event A is the event that A does not occur,written as Ac .
☛ The complement rule states that the probability of an event notoccurring is 1 minus the probability that it does occur. P(not A) =P(Ac) = 1 - P(A)
☛ Example 1: Tailc= not Tail = Head; P(Tailc) = P(Head)=0.5
☛ Example 2: if P(score≥80)=0.6, then P(score< 80)=0.4
☛ Venn diagram: Sample space made up of an event A and itscomplementary Ac , i.e., everything that is not A.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 103 / 227
Probabilities: Finite Number of Outcomes
☛ Finite sample spaces deal with discrete data — data that can only take on alimited number of values.
☛ The individual outcomes of a random phenomenon are always disjoint.
☛ Addition rule: the probability of any event is the sum of the probabilities ofthe outcomes making up the event.
☛ Example: M&M candies
If you draw an M&M candy at random from a bag, the candy will have oneof six colors. Assume
☞ The probability that an M&M chosen at random is blue:P(blue)=1-[P(brown)+P(red)+P(yellow)+P(green)+P(orange)]=0.1
☞ The probability that a random M&M is either red, yellow, or orange:P(red or yellow or orange)=P(red)+P(yellow)+P(orange)=0.5
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 104 / 227
Probabilities: Equally Likely Outcomes
☛ We can assign probabilities either:
☞ empirically : from our knowledge of numerous similar past events☞ or theoretically : from our understanding of the phenomenon and
symmetries in the problem
☛ If a random phenomenon has k equally likely possible outcomes, then eachindividual outcome has probability 1/k . And, for any event A:
P(A) =count of outcomes in A
count of outcomes in S
☛ Example: Toss two dice,
P(the roll of two dice sums to 5)=P(1,4) + P(2,3) + P(3,2) + P(4,1) = 4/ 36 = 0.111
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 105 / 227
Discussion: Do You Want to Play the Roulette?
☛ The payout, for American and European roulette, can be calculated bypayout = 36
n− 1, where n is the number of squares the player is betting on. The
initial bet is returned in addition to the mentioned payout.
☛ The house average or house edge (expected value) is the amount the player losesrelative for any bet made, on average.
☛ The hold is the average percentage of the money originally brought to the tablethat the player loses before he leaves. The average win/hold for double zerowheels is between 21− 30%, significantly more than the 5.26% house edge.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 106 / 227
Probability Rules — Multiplication Rule for IndependentEvents
☛ Two events A and B are independent if knowing that one occurs doesnot change the probability that the other occurs.
☛ Multiplication rule for independent events: if A and B areindependent, P(A and B) = P(A)P(B).
☛ Example: Two consecutive coin tosses: P(first Tail and second Tail)= P(first Tail) × P(second Tail) = 0.5× 0.5 = 0.25.
☛ Venn diagram: Event A and event B. The intersection represents theevent {A and B} and outcomes common to both A and B.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 107 / 227
Example — Application of Addition/Multiplication Rules
☛ A couple wants three children. What are the arrangements of boys (B) and girls(G)? Assume that the probability that a baby is a boy or a girl is the same, 0.5.
☛ Sample space: BBB, BBG, BGB, GBB, GGB, GBG, BGG, GGG
☞ All eight outcomes in the sample space are equally likely. The probability ofeach is thus 1/8.
☞ Each birth is independent of the next, so we can use the multiplication rule.Example: P(BBB) = P(B) × P(B) × P(B) = 1
2× 1
2× 1
2= 1
8
☛ A couple wants three children. What are the numbers of girls (X) they could have?The same genetic laws apply. We can use the probabilities above and the additionrule for disjoint events to calculate the probabilities for X.
☛ Sample space: 0, 1, 2, 3
☞ P(X = 0) = P(BBB) = 1/8☞ P(X = 1) = P(BBG or BGB or GBB) = P(BBG) + P(BGB) + P(GBB) =
3/8☞ and so on,
Value of X 0 1 2 3Probability 1
838
38
18
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 108 / 227
Discrete Random Variables
☛ A random variable is a variable whose value is a numerical outcomeof a random phenomenon.
☛ A discrete random variable X has a finite number of possible values.The probability distribution of a random variable X lists the valuesand their probabilities. The probabilities pi must add up to 1.
Value of X x1 x2 x3 · · · xkProbability p1 p2 p3 · · · pk
☛ The probability of any event is the sum of the probabilities pi of thevalues of X that make up the event.
☛ A coin was tossed twice. What is the probability to have at least 1head?
☛ A coin was tossed five times. What is the probability to have at least3 heads?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 109 / 227
Continuous Random Variables
☛ A continuous random variable X takes all values in an interval.
☛ How do we assign probabilities to events in an infinite sample space?
☞ Use density curves and compute probabilities for intervals.☞ The probability of any event is the area under the density curve for the
values of X that make up the event.
☛ The probability of a single event is meaningless for a continuous random variable.Only intervals can have a non-zero probability, represented by the area under thedensity curve for that interval.
☛ The shaded area under a density curve shows the proportion of individuals in apopulation with values of X between x1 and x2. Because the probability of drawingone individual at random depends on the frequency of this type of individual in thepopulation, the probability is also the shaded area under the curve.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 110 / 227
Normal Probability Distributions
☛ The probability distribution of many random variables is a normaldistribution. It shows what values the random variable can take and isused to assign probabilities to those values.
☛ To calculate probabilities with the normal distribution, we willstandardize the random variable (z score) and use Table A.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 111 / 227
Example: Normal Probability Distributions
☛ What is the probability, if we pick one woman at random, that her height willbe some value X? For instance, between 68 and 70 inches P(68 < X < 70)?
☛ Because the woman is selected at random, X is a random variable.
☛ z-scores:
z(68) =68− 64.5
2.5= 1.4; z(70) =
70− 64.5
2.5= 2.2.
☛ The area under the curve for the interval is 0.9861-0.9192 = 0.0669. Thus,the probability that a randomly chosen woman falls into this range is 6.69%,or P(68 < X < 70) = 6.69%.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 112 / 227
Mean of a Random Variable
☛ The mean x of a set of observations is their arithmetic average.
☛ The mean µ of a random variable X (also called expected value of X ) is aweighted average of the possible values of X , reflecting the fact that alloutcomes might not be equally likely.
☛ For a discrete random variable X with probability distribution,
µX = x1p1 + x2p2 + · · ·+ xkpk =∑
xipi .
☛ Example: the probability distribution is
Value of X 0 1 2 3Probability 0.027 0.189 0.441 0.43
Then the mean µ of X is
µX = 0× 0.027 + 1× 0.189 + 2× 0.441 + 3× 0.43 = 2.1.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 113 / 227
Law of Large Numbers
☛ The law of large numbers: as the number of randomly drawnobservations (n) in a sample increases, the mean of the sample (x)gets closer and closer to the population mean µ.
☛ It is valid for any population.
☛ We often intuitively expect predictability over a few randomobservations, but it is wrong. Example: the first toss is head, will thesecond one be tail?
☛ The law of large numbers only applies to really large numbers.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 114 / 227
Variance of a Random Variable
☛ The variance and the standard deviation are the measures of spread that accompany thechoice of the mean to measure center.
☛ The variance σ2x of a random variable is a weighted average of the squared deviations
(X − µX )2 of the variable X from its mean µX . Each outcome is weighted by its
probability in order to take into account outcomes that are not equally likely.
☛ The larger the variance of X , the more scattered the values of X on average. The positivesquare root of the variance gives the standard deviation σ of X .
☛ For a discrete random variable X , the variance σ2x is
σ2x =
∑
i
(xi − µX )2 pi .
☛ Example: the probability distribution is
Value of X 0 1 2 3Probability 0.027 0.189 0.441 0.43
Then the mean µ of X is
µX = 0× 0.027 + 1× 0.189 + 2× 0.441 + 3× 0.43 = 2.1.
The variance of X is
σ2 = 0.027× (0−2.1)2+0.189× (1−2.1)2+0.441× (2−2.1)2+0.343× (3−2.1)2 = 0.63
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 115 / 227
Rules for Means and Variances
☛ If X is a random variable and a and b are fixed numbers, then
µa+bX = a+ bµX ;
σ2a+bX = b2σ2
X .
☛ If X and Y are two independent random variables, then
µX±Y = µX ± µY ;
σ2X±Y = σ2
X + σ2Y .
☛ If X and Y are NOT independent but have correlation r , then
µX±Y = µX ± µY ;
σ2X±Y = σ2
X + σ2Y ± 2rσXσY .
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 116 / 227
Example: Investment
☛ You invest 20% of your funds in Treasury bills and 80% in an indexfund that represents all U.S. common stocks. Your rate of return overtime is proportional to that of the T-bills (X ) and of the index fund(Y ), such that
R = 0.2X + 0.8Y .
Based on annual returns between 1950 and 2003:☞ Annual return on T-bills µX = 5.0% and σX = 2.9%☞ Annual return on stocks µY = 13.2% and σY = 17.6%.☞ Correlation between X and Y is r = −0.11.
☛ Solution:
µR = 0.2µX + 0.8µY = 0.2× 5 + 0.8× 13.2 = 11.56%.
σ2R = σ2
0.2X+σ20.8Y+2rσ0.2Xσ0.8Y = 0.22σ2
X+0.82σ2Y+2r×0.2σX×0.8σY
σR =√196.786 = 14.03%.
The portfolio has a smaller mean return than an all-stock portfolio,but it is also less risky.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 117 / 227
General Addition Rule
☛ Recall: addition rule for disjoint events:
P(A or B) = P(A ∪ B) = P(A) + P(B).
☛ General addition rule for any two events A and B:The probability that A occurs, B occurs, or both events occur is:
P(A or B) = P(A) + P(B)− P(A and B).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 118 / 227
Example: General Addition Rule
☛ Question: what is the probability of randomly drawing either an aceor a heart from a deck of 52 playing cards?
☛ Facts: there are 4 aces in the pack and 13 hearts. However, 1 card isboth an ace and a heart.
☛ Solution:
P(ace) =4
52;P(heart) =
13
52;P( ace and heart) =
1
52;
P( ace or heart) = P(ace) + P(heart)− P( ace and heart)
=4
52+
13
52− 1
52=
16
52=
4
13.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 119 / 227
Conditional Probability
☛ Example 1: the probability that a cloudy day will result in rain isdifferent if you live in Los Angeles than if you live in Seattle.
☛ Example 2: what is the probability that you win the Jackpot? Howabout if I tell you that the winner is inside the classroom?
☛ Conditional probabilities reflect how the probability of an event canchange if we know that some other event has occurred or is occurring.
☛ Our brains effortlessly calculate conditional probabilities, updating our“degree of belief” with each new piece of evidence.
☛ The conditional probability of event B given event A is (provided thatP(A) 6= 0):
P(B |A) = P(A and B)
P(A).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 120 / 227
Example: Conditional Probability
☛ The conditional probability of event B given event A is (provided thatP(A) 6= 0):
P(B |A) = P(A and B)
P(A).
☛ Example: assume a number set includes whole numbers from 1 to100.
☞ Define event A as a number is an even number, and define eventB as a number is in the fourth quartile.
☞ Facts: P(A) = 50100 = 0.5;P(B) = 25
100 = 0.25;P(B and A) = 13
100 ;P(B |A) = 1350 ;P(A|B) = 13
25 .☞ Calculations:
P(B |A) = P(A and B)
P(A)=
13/100
50/100=
13
50;
P(A|B) = P(B and A)
P(B)=
13/100
25/100=
13
25.
Formula is verified.Zhaoxian Zhou (USM) CSS 211 January 11, 2018 121 / 227
Example — Understanding Conditional Probability
☛ Bertrand’s box paradox: There are three boxes, each with one drawer oneach of two sides. Each drawer contains a coin. One box has a gold coin oneach side (GG), one a silver coin on each side (SS), and the other a goldcoin on one side and a silver coin on the other (GS). A box is chosen atrandom, a random drawer is opened, and a gold coin is found inside it.What is the chance of the coin on the other side being gold?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 122 / 227
Example — Understanding Conditional Probability
☛ Bertrand’s box paradox: There are three boxes, each with one drawer oneach of two sides. Each drawer contains a coin. One box has a gold coin oneach side (GG), one a silver coin on each side (SS), and the other a goldcoin on one side and a silver coin on the other (GS). A box is chosen atrandom, a random drawer is opened, and a gold coin is found inside it.What is the chance of the coin on the other side being gold?
☛ Reasoning 1: Originally, all three boxes were equally likely to be chosen. Thechosen box cannot be box SS. So it must be box GG or GS. The tworemaining possibilities are equally likely. So the probability that the box isGG, and the other coin is also gold, is 1/2.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 122 / 227
Example — Understanding Conditional Probability
☛ Bertrand’s box paradox: There are three boxes, each with one drawer oneach of two sides. Each drawer contains a coin. One box has a gold coin oneach side (GG), one a silver coin on each side (SS), and the other a goldcoin on one side and a silver coin on the other (GS). A box is chosen atrandom, a random drawer is opened, and a gold coin is found inside it.What is the chance of the coin on the other side being gold?
☛ Reasoning 1: Originally, all three boxes were equally likely to be chosen. Thechosen box cannot be box SS. So it must be box GG or GS. The tworemaining possibilities are equally likely. So the probability that the box isGG, and the other coin is also gold, is 1/2.
☛ Reasoning 2: Originally, all six coins were equally likely to be chosen. Thechosen coin cannot be from drawer S of box GS, or from either drawer ofbox SS. So it must come from the G drawer of box GS, or either drawer ofbox GG. The three remaining possibilities are equally likely, so theprobability that the drawer is from box GG is 2/3.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 122 / 227
General Multiplication Rules
☛ The general multiplication rule: the probability that any two events,A and B, both occur is:
P(A and B) = P(A)P(B |A).
☛ Example: what is the probability of randomly drawing two hearts froma deck of 52 playing cards?There are 13 hearts in the pack. Let A and B be the events that thefirst and second cards drawn are hearts, respectively. Assume that thefirst card is not replaced before the second card is drawn.
P(A) =13
52=
1
4;P(B |A) = 12
51.
P(two hearts) = P(A)× P(B |A) = 1
4× 12
51=
1
17.
Notice that the probability of a heart on the second draw depends onwhich card was removed on the first draw.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 123 / 227
Independent Events
☛ Two events A and B that both have positive probability areindependent if P(B |A) = P(B)
☛ If A and B are independent, then P(A and B) = P(A)P(B)(A and B are independent when they have no influence on eachother’s occurrence.)
☛ Example: what is the probability of randomly drawing two hearts froma deck of 52 playing cards if the first card (event A) is replaced (andthe cards re-shuffled) before the second card (event B) is drawn.
P(A) =1
4;P(B) =
1
4;P(B |A) = 1
4.
Because P(B) = P(B |A), the two draws are independent events.
P(A and B) = P(A)× P(B) =1
4× 1
4=
1
16.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 124 / 227
Probability Trees
☛ Conditional probabilities can get complex, and it is often a goodstrategy to build a probability tree that represents all possibleoutcomes graphically and assigns conditional probabilities to subsetsof events.
☛ Example: tree diagram for chat room habits for three adults agegroups:
P(chatting)=0.136+0.099+0.017=0.252About 25% of all adult Internet users visit chat rooms.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 125 / 227
Example: Probability Trees
☛ If a woman in her 20s gets screened for breast cancer and receives a positivetest result, what is the probability that she does have breast cancer?
☛ Solution: Possible outcomes given the positive diagnosis: positive test andbreast cancer or positive test but no cancer (false positive).
P(c |p) = P(c and p)
P(c and p) + P(nc and p)=
0.0004× 0.8
0.004× 0.8 + 0.9996× 0.1≈ 0.3%.
This value is called the positive predictive value, or PV+.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 126 / 227
Bayes’s Rule
☛ An important application of conditional probabilities is Bayes’s rule. Itis the foundation of many modern statistical applications beyond thescope of this textbook.
☛ If a sample space is decomposed in k disjoint events A1,A2, · · · ,Ak ,none with a null probability but P(A1) + P(A2) + · · ·+ P(Ak) = 1,and if C is any other event such that P(C) is not 0 or 1, then
P(Ai |C ) =P(C |Ai )P(Ai )
P(C |A1)P(A1) + P(C |A2)P(A2) + · · ·+ P(C |Ak)P(Ak).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 127 / 227
Example: Bayes’s Rule
☛ If a woman in her 20s gets screened for breast cancer and receives a positivetest result, what is the probability that she does have breast cancer?
☛ Solution: A1 is cancer, A2 is no cancer, C is a positive test result. UseBayes’s rule:
P(c |p) = P(p|c)P(c)P(p|c)P(c) + P(p|nc)P(nc)
=0.8× 0.0004
0.8× 0.004 + 0.1× 0.9996≈ 0.3%.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 128 / 227
Overview
1 Looking at Data — Distributions
2 Looking at Data — Relationships
3 Producing Data
4 Probability: the Study of Randomness
5 Sampling Distributions
6 Introduction to Inference
7 Inference for Distributions
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 129 / 227
Objectives
☛ Sampling distribution of a sample mean; The mean and standarddeviation of x ; For normally distributed populations; The central limittheorem; Weibull distributions
☛ Sampling distributions for counts and proportionsBinomial distributions for sample counts; Binomial distributions instatistical sampling; Binomial mean and standard deviation; Sampleproportions; Normal approximation; Binomial formulas
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 130 / 227
Reminder
☛ The two types of data
☞ Quantitative: Something that can be counted or measured andthen averaged across individuals in the population (e.g., yourheight, your age, your IQ score)
☞ Categorical: Something that falls into one of several categories.What can be counted is the proportion of individuals in eachcategory (e.g., your gender, your hair color, your blood type - A,B, AB, O).
☛ How do you figure it out? Ask:
☞ What are the n individuals/units in the sample (of size “n”)?☞ What is being recorded about those n individuals/units?☞ Is that a number (quantitative) or a statement (categorical)?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 131 / 227
Sampling Distribution of the Sample Mean
☛ We take many random samples of a given size n from a populationwith mean µ and standard deviation σ.
☛ Some sample means will be above the population mean µ and somewill be below, making up the sampling distribution.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 132 / 227
Sampling Distribution
☛ The sampling distribution of a statistic is the distribution of allpossible values taken by the statistic
︸ ︷︷ ︸e.g .mean
when all︸︷︷︸
e.g .how many?
possible
samples of a fixed size n︸︷︷︸
e.g .20
are taken from the population︸ ︷︷ ︸
e.g .USM,15000
.
☛ It is a theoretical idea - we do not actually build it. Why?The number of all different samples is
(15000
20
)
=15000!
14980!× 20!≈ 1.35× 1065.
☛ The sampling distribution of a statistic is the probability distributionof that statistic.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 133 / 227
Center of the SD of the Sample Mean
For any population with mean µ and standard deviation σ,
☛ Mean of a sampling distribution of x
☞ The mean, or center of the sampling distribution of x , is equal tothe population mean µ: µx = µ.
☞ There is no tendency for a sample mean to fall systematicallyabove or below µ, even if the distribution of the raw data isskewed. Thus, the mean of the sampling distribution is anunbiased estimate of the population mean µ — it will be“correct on average” in many samples.
☞ Example: Assume student scores in all USM classes satisfy adistrubution (Normal or not), with a mean of 80 and standarddeviation of 10. Taking each student as a sample of size 4, thenthe mean of the mean scores of all students is approximately ???
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 134 / 227
Spread of the SD of the Sample Mean
For any population with mean µ and standard deviation σ,
☛ Standard deviation of a sampling distribution of x
☞ The standard deviation of the sampling distribution is σx = σ√n,
where n is the sample size.☞ The standard deviation of the sampling distribution measures
how much the sample statistic varies from sample to sample.Meaning: averages of samples are less variable than individualobservations.
☞ Example: Assume student scores in all USM classes satisfy adistrubution (Normal or not), with a mean of 80 and standarddeviation of 10. Taking each student as a sample of size 4, thenthe standard deviation of the mean scores of all students isapproximately ???.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 135 / 227
For Normally Distributed Populations
When a variable in a population is normally distributed, the samplingdistribution of x for all possible samples of size n is also normallydistributed
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 136 / 227
Example
Hypokalemia is diagnosed when blood potassium levels are below3.5mEq/dl. Assume that we know a patient whose measured potassiumlevels vary daily according to a normal distribution N(3.8, 0.2).
☛ If only one measurement is made, what is the probability that thispatient will be misdiagnosed with Hypokalemia?
z =x − µ
σ=
3.5− 3.8
0.2= −1.5;P(z < −1.5) ≈ 7%.
☛ Instead, if measurements are taken on 4 separate days, what is theprobability of a misdiagnosis if average is used?
z =x − µ
σ/√n=
3.5− 3.8
0.2/√4
= −3.0;P(z < −3.0) ≈ 0.1%.
☛ Note: Make sure to standardize (z) using the standard deviation forthe sampling distribution.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 137 / 227
The Central Limit Theorem
Central Limit Theorem: When randomly sampling from any populationwith mean µ and standard deviation σ, when n is large enough, thesampling distribution of x is approximately normal: ∼ N(µ, σ/
√n).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 138 / 227
Linear Combination of Independent Normal RandomVariables
☛ Any linear combination of independent normal random variables isalso normally distributed.
☛ Example: assume X satisfies N(20, 4), Y satisfies N(18, 8), then thedifference X − Y is also Normally distributed.
☞ Its meanµX−Y = µX − µY = 20− 18 = 2.
Its varianceσ2X−Y = σ2
X + σ2Y = 80.
☞ Therefore the difference X − Y satisfies N(2, 8.94).☞ The probability that X < Y is P(X < Y ) = P(X − Y < 0).
z =0− 2
8.94= −0.22,
andP(X < Y ) = P(z < −0.22) = 0.4129.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 139 / 227
Further Properties
☛ More generally, the central limit theorem is valid as long as we aresampling many small random events, even if the events have differentdistributions (as long as no one random event dominates the others).
☛ It explains why the normal distribution is so common.
☛ Example: Height seems to be determined by a large number ofgenetic and environmental factors, like nutrition. The “individuals”are genes and environmental factors. Your height is a mean.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 140 / 227
Weibull Distributions
☛ Weibull distributions are used to model time to failure/productlifetime and are common in engineering to study product reliability.
☛ Product lifetimes can be measured in units of time, distances, ornumber of cycles for example. Some applications include:
☞ Quality control (breaking strength of products and parts, foodshelf life)
☞ Maintenance planning (scheduled car revision, airplanemaintenance)
☞ Cost analysis and control (number of returns under warranty,delivery time)
☞ Research (materials properties, microbial resistance to treatment)
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 141 / 227
Examples of Weibull Distributions
Density curves of three members of the Weibull family describing a different typeof product time to failure in manufacturing.
☛ Infant mortality: Many products fail immediately and the remainders last along time. Manufacturers only ship the products after inspection.
☛ Early failure: Products usually fail shortly after they are sold. The design orproduction must be fixed.
☛ Old-age wear out: Most products wear out over time, and many fail atabout the same age. This should be disclosed to customers.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 142 / 227
Binomial Distributions
☛ Binomial distributions are models for some categorical variables,typically representing the number of successes in a series of n trials.
☛ The observations must meet these requirements:
☞ The total number of observations n is fixed in advance.☞ Each observation is 1 of 2 categories: success and failure.☞ The outcomes of all n observations are statistically independent.☞ All n observations have the same probability of “success,” p.
☛ Example: I have 10 coins to toss at the same time. What isdistribution of the number X of head? Here n = 10; the twocategories are head (success) and tail (failure); the output of eachtoss is independent; p = 0.5.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 143 / 227
Binomial Distributions for Sample Counts
☛ Express a binomial distribution for the count X of successes among nobservations as a function of the parameters n and p: B(n, p), wherep is the probability of success on each observation.
☛ Example 1: coin tossing. The binomial distribution for the count X ofhead is B(10, 12).
☛ Example 2: record the next 50 births at a local hospital. Eachnewborn is either a boy or a girl; each baby is either born on a Sundayor not.
☞ What is the binomial distribution for boys?☞ What is the binomial distribution to be on Sunday?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 144 / 227
Binomial Distribution in Statistical Sampling
☛ Choosing a simple random sample (SRS) from any population is notquite a binomial setting. However, when the population is large,removing a few items has a very small effect on the composition ofthe remaining population: successive observations are very nearlyindependent.
☛ A population contains a proportion p of successes. If the population ismuch larger than the sample, the count X of successes in an SRS ofsize n has approximately the binomial distribution B(n, p).
☛ The n observations will be nearly independent when the size of thepopulation is much larger than the size of the sample. As a rule ofthumb, the binomial sampling distribution for counts can be usedwhen the population is at least 20 times as large as the sample.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 145 / 227
Calculations for Binomial Probabilities
☛ The binomial probability P(X = k) is binomial coefficient multipliedby the probability of any specific arrangement of the k successes:
P(X = k) =
(n
k
)
pk(1− p)n−k =n!
k!(n − k)!pk(1− p)n−k .
☛ The probability that a binomial random variable takes any range ofvalues is the sum of each probability for getting exactly that manysuccesses in n observations.Example:
P(X ≤ 2) = P(X = 0) + P(X = 1) + P(X = 2).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 146 / 227
Example: Color Blindness
☛ The frequency of color blindness (dyschromatopsia) in the CaucasianAmerican male population is estimated to be about 8%. We take arandom sample of size 25 from this population. The population isdefinitely larger than 20 times the sample size, thus we canapproximate the sampling distribution by B(n = 25, p = 0.08).
☛ What is the probability that exactly five will be color blind?
P(X = 5) =
(n
k
)
pk(1− p)n−k =25!
5!20!0.085(1− 0.08)20 = 0.0329.
☛ What is the probability that five individuals or fewer in the sample arecolor blind?
☛ What is the probability that more than five will be color blind?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 147 / 227
Example — Cont’d
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 148 / 227
Binomial Mean and Standard Deviation
☛ The center and spread of the binomial distribution for a count X aredefined by the mean µ and standard deviation σ.
µ = np;σ =√
np(1− p) =√npq.
☛ Example: the effect of changing p when n is fixed at n = 10:p = 0.25; p = 0.5, and p = 0.75.
☛ For small samples, binomial distributions are skewed when p isdifferent from 0.5.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 149 / 227
Example — Binomial Mean and Standard Deviation
☛ The mean and standard deviation of the count of color blindindividuals in the SRS of 25 Caucasian American males:
µ = np = 25×0.08 = 2;σ =√
np(1− p) =√25× 0.08× 0.92 = 1.36.
☛ When size is 10, µ = 0.8;σ = 0.86.
☛ When size is 75, µ = 6;σ = 3.35.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 150 / 227
Sample Proportions
☛ The proportion of “successes” can be more informative than the count. Instatistical sampling the sample proportion of successes, p, is used toestimate the proportion p of successes in a population. For any SRS of sizen, the sample proportion of successes is
p =count of successes in the sample
n=
X
n.
☛ If the sample size is much smaller than the size of a population withproportion p of successes, then the mean and standard deviation of p are
µp = p;σp =
√
p(1− p)
n
☞ The sample proportion in an SRS is an unbiased estimator of thepopulation proportion p.
☞ The variability decreases as the sample size increases. So largersamples usually give closer estimates of the population proportion p.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 151 / 227
Normal Approximation
☛ If n is large, and p is not too close to 0 or 1, the binomial distributioncan be approximated by the normal distribution
N(µ = np, σ2 = np(1− p)).
Practically, the Normal approximation can be used when bothnp ≥ 10 and n(1− p) ≥ 10.
☛ If X is the count of successes in the sample and the sample proportionof successes p = X
n, their sampling distributions for large n, are:
☞ X approximately N(µ = np, σ2 = np(1− p))
☞ p is approximately N(µ = p, σ2 = p(1−p)n
).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 152 / 227
Normal Approximation — Cont’d
☛ The sampling distribution of p is never exactly normal. But as thesample size increases, the sampling distribution of p becomesapproximately normal. The normal approximation is most accurate forany fixed n when p is close to 0.5, and least accurate when p is near0 or near 1.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 153 / 227
Normal Approximation — Continuity Correction
☛ Assume the frequency of color blindness in the Caucasian Americanmale population is about 8%. Take a random sample of size 125 fromthis population. What is the probability that six individuals or fewerin the sample are color blind?
☞ Sampling distribution of the count X is binomial:
B(n = 125, p = 0.08),
soP(X ≤ 6) = 0.1198.
☞ Normal approximation for the count X :
N(np,√
np(1− p)), or N(10, 3.033),
so,
z =x − µ
σ=
6− 10
3.033= −1.32 =⇒ P(X ≤ 6) = 0.0934.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 154 / 227
Normal Approximation — Continuity Correction — Cont’d
☛ A binomial random variable is a discrete variable that can only takewhole numerical values.
☛ In contrast, a normal random variable is a continuous variable thatcan take any numerical value.
☛ The normal distribution is a better approximation of the binomialdistribution with a continuity correction:
☞ variable x ‘ = x + 0.5 is substituted for x ,☞ and P(X ≤ x) is replaced by P(X ≤ x + 0.5).☞ In this example,
P(X ≤ 6.5) = 0.1243,
which approximates the binomial distribution better than Normaldistribution without continuity correction does.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 155 / 227
Overview
1 Looking at Data — Distributions
2 Looking at Data — Relationships
3 Producing Data
4 Probability: the Study of Randomness
5 Sampling Distributions
6 Introduction to Inference
7 Inference for Distributions
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 156 / 227
Objectives
☛ Estimating with confidenceStatistical confidence; Confidence intervals; Confidence interval for apopulation mean; How confidence intervals behave; Choosing thesample size
☛ Tests of significanceThe reasoning of significance tests; Stating hypotheses; The P-value;Statistical significance; Tests for a population mean; Confidenceintervals to test hypotheses
☛ Use and abuse of tests
☛ Power and inference as a decisionCautions about significance tests; Power of a test; Type I and IIerrors; Error probabilities
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 157 / 227
Overview of Inference
☛ Methods for drawing conclusions about a population from sampledata are called statistical inference.
☛ Methods
☞ Confidence Intervals - estimating a value of a populationparameter. A range of values with an associated confidence levelC.
☞ Tests of significance - assess evidence for a claim about apopulation. Significance level: The largest P-value tolerated forrejecting a true null hypothesis.
☛ Inference is appropriate when data are produced by either
☞ a random sample or☞ a randomized experiment
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 158 / 227
Statistical Confidence
☛ Although the sample mean x is a unique number for any particularsample, if you pick a different sample you will probably get a differentsample mean.
☛ In fact, there are many different values for the sample mean, andvirtually none of them would actually equal the true population meanµ.
☛ But the sample distribution is narrower than the populationdistribution, by a factor of
√n.
☛ Thus the estimates gained from our samples are always relativelyclose to the population parameter µ.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 159 / 227
Discussion: How to Obtain Population Parameter fromSamples Statistics
☛ Number of samples:
☞ Do we need multiple samples?☞ How about only one sample?
☛ Type of approximation:
☞ Point estimates are the single, most likely value of a parameter.For example, the point estimate of population mean (theparameter) is the sample mean (the parameter estimate).
☞ Confidence intervals are a range of values likely to contain thepopulation parameter.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 160 / 227
The Essence of Statistical Inference
☛ If we know the populationparameter µ:95% of all sample means willbe within roughly 2 standarddeviations ( 2σ√
n) of the popu-
lation parameter µ.
☛ If we DO NOT know thepopulation parameter µ:Distances are symmetricalwhich implies that the pop-ulation parameter µ must bewithin roughly 2 standard de-viations from the sample av-erage x , in 95% of all sam-ples.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 161 / 227
Example: Statistical Confidence
☛ The weight of single eggs of the brown variety is normally distributed N(65g , 5g).Think of a carton of 12 brown eggs as an SRS of size 12.
☞ The distribution of the sample means x is N(µ, σ√n) = N(65g , 1.44g).
☞ The middle 95% of the sample means distribution is roughly ± 2σ√nof x from
the mean, or 65g ± 2.88g .
☛ You buy a carton of 12 white eggs instead. The box weighs 770 g. The averageegg weight from that SRS is thus x = 64.2g .
☞ Knowing that the standard deviation of egg weight is 5 g, what can you inferabout the mean µ of the white egg population?We are 95% confident that the population mean µ is between 64.2g ± 2.88g ,or roughly within ± 2σ√
nof x .
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 162 / 227
Confidence Intervals
☛ The confidence interval is a range of values with an associatedconfidence level C . Confidence intervals for means are intervalsconstructed using a procedure that will contain the population meana specified proportion (C ) of the time.
☛ ±4.2 is a 95% confidence interval for the population parameter µ.This equation says that in 95% of the cases, the actual value of µ willbe within 4.2 units of the value of x .
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 163 / 227
Implications
☛ We don’t need to takea lot of random samplesto “rebuild” the samplingdistribution and find µ atits center.
☛ All we need is one SRSof size n and rely on theproperties of the samplemeans distribution to in-fer the population meanµ.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 164 / 227
Confidence Intervals — Cont’d
With 95% confidence, we cansay that µ should be withinroughly 2 standard deviations2σ√nfrom our sample mean x .
☛ In 95% of all possiblesamples of this size n, µwill indeed fall in our con-fidence interval.
☛ In only 5% of sampleswould x be farther fromµ.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 165 / 227
Understanding Confidence Intervals
☛ Can we interpret a 95% confidence interval as an interval with a 0.95probability of containing the population mean?
☛ Question: Strictly speaking, what is the best interpretation of a 95%confidence interval for the mean?
☞ If repeated samples were taken and the 95% confidence intervalwas computed for each sample, 95% of the intervals wouldcontain the population mean.
☞ A 95% confidence interval has a 0.95 probability of containingthe population mean.
☞ 95% of the population distribution is contained in the confidenceinterval.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 166 / 227
Confidence Intervals — Cont’d
☛ A confidence interval can be expressed as:
☞ Mean ±m, where m is called the margin of error , i.e., µ is within x ±m.Example: 120± 6
☞ Two endpoints of an interval:µ is within (x −m) to (x +m). Example: 114 to 126.
☛ A confidence level C (in %) indicates if we were to repeat the whole experiment Ntimes, under the same conditions, then we would have N different confidenceintervals. The confidence level is the proportion of these intervals which containthe true mean of the population.
☛ It represents the area under the normal curve within ±m of the center of the curve.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 167 / 227
Review: Standardizing the Normal Curve Using z
☛ σ is the standard deviation of the original population.
☛ Here, we work with the sampling distribution, and σ√nis its standard
deviation (spread).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 168 / 227
Varying Confidence Levels
☛ Confidence intervals contain the population mean µ in C% of samples. Differentareas under the curve give different confidence levels C .
☛ Practical use of z : z∗
☞ z∗ is related to the chosen confidence level C .☞ C is the area under the standard normal curve between −z∗ and z∗.
☛ The margin of error and confidence interval are thus
z∗ σ√
nand x ± z
∗ σ√n.
☛ Example: For an 80% confidence level C , 80% of the normal curve’s area iscontained in the interval.
☛ Use Table D to find specific z∗ values.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 169 / 227
Link Between Confidence Level and Margin of Error
☛ The confidence level C determines the value of z∗. The margin oferror also depends on z∗.
☛ Higher confidence C implies a larger margin of error m (thus lessprecision in our estimates).
☛ A lower confidence level C produces a smaller margin of error m (thusbetter precision in our estimates).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 170 / 227
Example: Different Confidence Intervals for the Same Setof Measurements
☛ Density of bacteria in solution:Measurement equipment has standard deviation σ = 1× 106
bacteria/ml fluid. Three measurements: 24, 29, and 31× 106
bacteria/ml fluid. The mean is 28× 106 bacteria/ml. Find the 96%and 70% CI.
☛ 96% confidence interval for the true density, z∗ = 2.054, and write
x ± z∗σ√n= 28± 2.054× 1√
3= 28± 1.19× 106 bacteria/ml.
☛ 70% confidence interval for the true density, z∗ = 1.036, and write
x ± z∗σ√n= 28± 1.036× 1√
3= 28± 0.60× 106 bacteria/ml.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 171 / 227
Properties of Confidence Intervals
☛ User chooses the confidence level, margin of error follows from thischoice.
☛ The margin of error, z∗σ√n, gets smaller when z∗ (and thus the
confidence level C ) gets smaller, σ is smaller, or n is larger.☛ The spread in the sampling distribution of the mean is a function of
the number of individuals per sample.☞ The larger the sample size, the smaller the standard deviation
(spread) of the sample mean distribution.☞ But the spread only decreases at a rate equal to
√n.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 172 / 227
Sample Size and Experimental Design
☛ You may need a certain margin of error (e.g., drug trial,manufacturing specs). In many cases, the population variability (σ) isfixed, but we can choose the number of measurements (n). So planahead what sample size to use to achieve that margin of error.
m = z∗σ√n⇐⇒ n =
(z∗σ
m
)2
.
☛ Example: Measurement equipment has standard deviationσ = 1× 106 bacteria/ml fluid. How many measurements should youmake to obtain a margin of error of at most 0.5× 106 bacteria/mlwith a confidence level of 95%?For a 95% confidence interval, z∗ = 1.96.
n =
(z∗σ
m
)2
=
(1.96× 1
0.5
)2
= 15.3664.
Therefore, we need at least 16 measurements.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 173 / 227
Interpretation of Confidence Intervals
☛ Conditions under which an inference method is valid arenever fully met in practice. Exploratory data analysis and judgmentshould be used when deciding whether or not to use a statisticalprocedure.
☛ Any individual confidence interval either will or will not contain thetrue population mean. It is wrong to say that the probability is 95%that the true mean falls in the confidence interval.
☛ The correct interpretation of a 95% confidence interval is that we are95% confident that the true mean falls within the interval. Theconfidence interval was calculated by a method that gives correctresults in 95% of all possible samples.In other words, if many such confidence intervals were constructed,95% of these intervals would contain the true mean.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 174 / 227
Reasoning of Significance Tests
☛ We have seen that the properties of the sampling distribution of xhelp us estimate a range of likely values for population mean µ.
☛ We can also rely on the properties of the sample distribution to testhypotheses.
☛ Example: You are in charge of quality control in your food company.You sample randomly four packs of cherry tomatoes, each labeled 0.5lb. (227 g).The average weight from your four boxes is 222 g. Obviously, wecannot expect boxes filled with whole tomatoes to all weigh exactlyhalf a pound. Thus,
☞ Is the somewhat smaller weight simply due to chance variation?☞ Is it evidence that the calibrating machine that sorts cherry
tomatoes into packs needs revision?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 175 / 227
Hypotheses
☛ A test of statistical significance tests a specific hypothesis usingsample data to decide on the validity of the hypothesis.
☛ In statistics, a hypothesis is an assumption or a theory about thecharacteristics of one or more variables in one or more populations.
☛ In the example above,
☞ What you want to know: does the calibrating machine that sortscherry tomatoes into packs need revision?
☞ The same question reframed statistically: is the population meanµ for the distribution of weights of cherry tomato packages equalto 227 g (i.e., half a pound)?
☛ Another example: Assume USM average GPA is 2.5. We found thatthe average GPA of our class is 3.9. Is this normal? In other words,we found that our mean GPA is 3.9, can we conclude that the USMmean GPA is 2.5?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 176 / 227
Stating Hypotheses
☛ The null hypothesis is a very specific statement about a parameter ofthe population(s). It is labeled H0.
☛ The alternative hypothesis is a more general statement about aparameter of the population(s) that is exclusive of the nullhypothesis. It is labeled Ha.
☛ Exmple: weight of cherry tomato packs:
☞ H0 : µ = 227 g (µ is the average weight of the population ofpacks)
☞ Ha : µ 6= 227 g (µ is either larger or smaller).
☛ Exmple: USM GPA:
☞ H0 : µ = 2.5 (µ is the mean GPA of the USM population)☞ Ha : µ > 2.5 (µ is larger).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 177 / 227
One-Sided and Two-Sided Tests
☛ A two-tail or two-sided test of the population mean has these nulland alternative hypotheses:H0 : µ=[a specific number]; Ha : µ 6= [a specific number]
☛ A one-tail or one-sided test of a population mean has these null andalternative hypotheses:H0 : µ= [a specific number]; Ha : µ < [a specific number]OrH0 : µ = [a specific number]; Ha : µ > [a specific number]
☛ Example: The FDA tests whether a generic drug has an absorptionextent similar to the known absorption extent of the brand-name drugit is copying. Higher or lower absorption would both be problematic,thus we test:H0 : µgeneric = µbrand ; Ha : µgeneric 6= µbrand .
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 178 / 227
The P-Value
☛ Example: The packaging process has a known standard deviationσ = 5g . The null and alternative hypotheses are:H0 : µ = 227g ; Ha : µ 6= 227g .
☛ The average weight from your four random boxes is 222 g. What isthe probability of drawing a random sample such as yours if H0 is true(which means the mean of the population is indeed 227 g)?
☛ Tests of statistical significance quantify the chance of obtaining aparticular random sample result if the null hypothesis were true. Thisquantity is the P-value.
☛ This is a way of assessing the “believability” of the null hypothesis,given the evidence provided by a random sample.
☛ Example: the odds of winning powerball grand prize is 1 in292,201,338. A random person (you) won the Jackpot. What is H0?What is Ha? What is the p-value?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 179 / 227
The P-Value — Rejecting Null Hypothsis
☛ We know that the null hypothsis and the sample statistic do notmatch all the time.
☛ You ask: could random variation alone account for the differencebetween the null hypothesis and observations from a random sample?
☛ Small P-value implies that random variation due to the samplingprocess alone is not likely to account for the observed difference.
☛ With a small p-value we reject H0, which means that the true propertyof the population is significantly different from what was stated in H0.
☛ Thus, small P-values are strong evidence AGAINST H0.
☛ In the powerball example, what is your conclusion?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 180 / 227
Significant P-Value
☛ When the shaded area is small, the probability of drawing such asample at random gets very slim.
☛ Oftentimes, a P-value of 0.05 or less is considered significant: thephenomenon observed is unlikely to be entirely due to chance eventfrom the random sampling.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 181 / 227
Testing of the Null Hypothesis
☛ To test the hypothesis H0 : µ = µ0 based on an SRS of size n from aNormal population with unknown mean µ and known standard deviation σ,we rely on the properties of the sampling distribution N(µ, σ
√
n).
☛ The P-value is the area under the sampling distribution for values at least asextreme, in the direction of Ha, as that of our random sample.
☛ z-score: z = x−µσ√
n
☛ The p-value of one-sided test for Ha : µ > µ0 is P(Z ≥ z) or forHa : µ < µ0 is P(Z ≤ z).
☛ The p-value of two-sided test for Ha : µ 6= µ0 is 2P(Z ≥ |z |).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 182 / 227
Example: Does the Packaging Machine Need Revision?
☛ H0 : µ = 227g versus Ha : µ 6= 227g .What is the probability of drawing a random sample such as yours ifH0 is true?
☛ x = 222g ;σ = 5g ; n = 4, then z = 222−2275/
√4
= −2.
☛ P-value of the two-sided test= 2× P(z > −2) = 2× 0.0228 = 4.56%.
☛ The probability of getting a random sample average so different fromµ is so low that we reject H0. The machine does need recalibration.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 183 / 227
Summary: Steps for Tests of Significance
☛ State the null hypotheses H0 and the alternative hypothesis Ha.
☞ H0 represents a theory that has been put forward, either becauseit is believed to be true or because it is to be used as a basis forargument, but has not been proved.
☞ The alternative hypothesis, Ha, is a statement of what astatistical hypothesis test is set up to establish.
☛ Calculate value of the test statistic.
☛ Determine the P-value for the observed data.
☛ State a conclusion.
☞ We either “reject H0 in favor of Ha” or “do not reject H0”☞ We never conclude “reject Ha”, or even “accept Ha”.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 184 / 227
Understanding the P-Value
Reference from https://en.wikipedia.org/wiki/P-value
☛ The p-value is not the probability that the null hypothesis is true orthe probability that the alternative hypothesis is false. It is notconnected to either.
☛ The p-value is not the probability that a finding is “merely a fluke.”
☛ The p-value is not the probability of falsely rejecting the nullhypothesis.
☛ The p-value is not the probability that replicating the experimentwould yield the same conclusion.
☛ The significance level, such as 0.05, is not determined by the p-value.
☛ The p-value does not indicate the size or importance of the observedeffect.
☛ The concept of p-value is far from perfect.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 185 / 227
The Significance Level: α
☛ The significance level , α, is the largest P-value tolerated for rejectinga true null hypothesis (how much evidence against H0 we require).
☛ This value is decided arbitrarily before conducting the test.
☞ If the P-value is equal to or less than α(P ≤ α), then we rejectH0.
☞ If the P-value is greater than α(P > α), then we fail to reject H0.
☛ Example: Does the packaging machine need revision?Answer: two-sided test; the P-value is 4.56%.
☞ If α had been set to 5%, then the P-value would be significant.☞ If α had been set to 1%, then the P-value would not be
significant.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 186 / 227
The Significance Level — Cont’d
When the z score falls within the rejection region (shaded area on thetail-side), the p-value is smaller than α and you have shown statisticalsignificance.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 187 / 227
Rejection Region for a Two-Tail Test of µ with α = 0.05
A two-sided test means that α is spread between both tails of the curve,thus a middle area C of 1− α = 95%, and an upper tail area ofα/2 = 0.025.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 188 / 227
Confidence Intervals to Test Hypotheses
☛ A two-sided test is symmetrical, we can also use a confidence interval to test atwo-sided hypothesis.
☛ In a two-sided test, C = 1− α,where C is confidence level, α is significance level.
☛ Example: Packs of cherry tomatoes (σ = 5g)
H0 : µ = 227g versus Ha : µ 6= 227g .
Sample average 222 g. 95% CI for µ = 222± 1.96× 5/√4 = 222g ± 4.9g . 227 g
does not belong to the 95% CI (217.1 to 226.9 g). Thus, we reject H0.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 189 / 227
Logic of Confidence Interval Test
☛ A confidence interval gives a black and white answer: reject or don’treject H0. But it also estimates a range of likely values for the truepopulation mean µ.
☛ A P-value quantifies how strong the evidence is against the H0. But ifyou reject H0, it doesn’t provide any information about the truepopulation mean µ.
☛ Example: a sample gives a 99% confidence interval ofx ±m = 0.84± 0.0101. With 99% confidence, could samples be frompopulations with µ = 0.86? µ = 0.85?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 190 / 227
Choosing the Significance Level α
☛ Factors often considered:
☞ What are the consequences of rejecting the null hypothesis (e.g.,global warming, convicting a person for life with DNA evidence)?
☞ Are you conducting a preliminary study? If so, you may want alarger α so that you will be less likely to miss an interestingresult.
☛ Some conventions:
☞ We typically use the standards of our field of work.☞ There are no “sharp” cutoffs: e.g., 4.9% versus 5.1%.☞ It is the order of magnitude of the P-value that matters:
“somewhat significant,” “significant,” or “very significant.”
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 191 / 227
Practical Significance
☛ Statistical significance only says whether the effect observed is likelyto be due to chance alone because of random sampling.
☛ Statistical significance may not be practically important. That isbecause statistical significance doesn’t tell you about the magnitudeof the effect, only that there is one.
☛ An effect could be too small to be relevant. And with a large enoughsample size, significance can be reached even for the tiniest effect.
☞ Example: a drug to lower temperature is found to reproduciblylower patient temperature by 0.4◦ Celsius (P-value< 0.01). Butclinical benefits of temperature reduction only appear for a 1◦
decrease or larger.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 192 / 227
Don’t Ignore Lack of Significance
☛ Failing to find statistical significance in results is not rejecting the nullhypothesis. This is very different from actually accepting it. Thesample size, for instance, could be too small to overcome largevariability in the population.
☛ When comparing two populations, lack of significance does not implythat the two samples come from the same population. They couldrepresent two very distinct populations with similar mathematicalproperties.
☞ Consider this provocative title from the British Medical Journal:“Absence of evidence is not evidence of absence.”
☞ Having no proof of who committed a murder does not imply thatthe murder was not committed.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 193 / 227
Interpreting Effect Size: Its All About Context
☛ There is no consensus on how big an effect has to be in order to beconsidered meaningful. In some cases, effects that may appear to betrivial can be very important.
☞ Example: Improving the format of a computerized test reducesthe average response time by about 2 seconds. Although thiseffect is small, it is important since this is done millions of timesa year. The cumulative time savings of using the better format isgigantic.
☛ Always think about the context. Try to plot your results, andcompare them with a baseline or results from similar studies.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 194 / 227
Overview
1 Looking at Data — Distributions
2 Looking at Data — Relationships
3 Producing Data
4 Probability: the Study of Randomness
5 Sampling Distributions
6 Introduction to Inference
7 Inference for Distributions
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 195 / 227
Objectives
☛ Inference for the mean of a populationThe t distributions; The one-sample t confidence interval; Theone-sample t test; Matched pairs t procedures; Robustness
☛ Comparing two meansTwo-sample z statistic; Two-samples t procedures; Two-sample tsignificance test; Two-sample t confidence interval; Robustness
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 196 / 227
An Example: Sweetening Colas
☛ Cola manufacturers want to test how much the sweetness of a new cola drink isaffected by storage. The sweetness loss due to storage was evaluated by 10professional tasters (by comparing the sweetness before and after storage):
Taster Sweetness loss
1 2.0
2 0.4
3 0.7
4 2.0
5 −0.4
6 2.2
· · · · · ·
☛ Obviously, we want to test if storage results in a loss of sweetness, thus:H0 : µ = 0 versus Ha : µ > 0
☛ This looks familiar. However, here we do not know the population parameter σ.
☞ The population of all cola drinkers is too large.☞ Since this is a new cola recipe, we have no population data.
☛ This situation is very common with real data.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 197 / 227
When σ is Unknown
☛ The sample standard deviation s provides an estimate of thepopulation standard deviation σ.
☛ When the sample size is large, the sample is likely to contain elementsrepresentative of the whole population. Then s is a good estimate ofσ.
☛ But when the sample size is small, the sample contains only a fewindividuals. Then s is a mediocre estimate of σ.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 198 / 227
Standard Deviation s —Standard Error s/√n
☛ For a sample of size n, the sample standard deviation s is:
s =
√
1
n − 1
∑
(xi − x)2,
n − 1 is the “degrees of freedom.”
☛ The value s/√n is called the standard error of the mean SEM.
Scientists often present sample results as mean ± SEM.
☛ Example: A study examined the effect of a new medication on theseated systolic blood pressure. The results, presented as mean ±SEM for 25 patients, are 113.5± 8.9. What is the standard deviations of the sample data?Solution: SEM=s/
√n, so s = 8.9×
√25 = 44.5.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 199 / 227
The t Distributions
☛ Suppose that an SRS of size n is drawn from an N(µ, σ) population.
☛ When σ is known, the sampling distribution is N(µ, σ/√n).
☛ When σ is estimated from the sample standard deviation s, thesampling distribution follows a t distribution t(µ, s/
√n) with degrees
of freedom n − 1. t = x−µs/
√nis the one-sample t statistic .
☛ When n is very large, s is a very good estimate of σ, and thecorresponding t distributions are very close to the normal distribution.
☛ The t distributions become wider for smaller sample sizes, reflectingthe lack of precision in estimating σ from s.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 200 / 227
Standardizing t Distribution
☛ As with the normal distribution, the first step is to standardize thedata. Then we can use Table D to obtain the area under the curve
☛ Here, µ is the mean (center) of the sampling distribution, and thestandard error of the mean s/
√n is its standard deviation (width).
You obtain s, the standard deviation of the sample, with yourcalculator.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 201 / 227
Standardizing t Distribution — Cont’d
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 202 / 227
Table A and Table D
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 203 / 227
The One-Sample t-Confidence Interval
☛ The level C confidence interval is an interval with confidence C ofcontaining the true population parameter.
☛ We have a data set from a population with both µ and σ unknown.We use x to estimate µ and s to estimate σ, using a t distribution (dfn − 1).
☛ Practical use of t : t∗
☞ C is the area between −t∗ and t∗.☞ We find t∗ in the line of Table D for df= n − 1 and confidence
level C .☞ The margin of error m is: m = t∗s/
√n.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 204 / 227
Example: Red Wine
☛ To see if moderate red wine consumption increases the average bloodlevel of polyphenols, a group of nine randomly selected healthy menwere assigned to drink half a bottle of red wine daily for two weeks.Their blood polyphenol levels were assessed before and after thestudy, and the percent change is presented as0.7; 3.5; 4; 4.9; 5.5; 7; 7.4; 8.1; 8.4.
☛ Are the data approximately normal? Yes, there is a low value, butoverall the data can be considered reasonably normal
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 205 / 227
Example: Red Wine — Cont’d
☛ What is the 95% confidence interval for the average percent change?Sample average = 5.5; s = 2.517; df = n-1 = 8
The sampling distribution is a t distribution with n − 1 degrees offreedom. For df = 8 and C = 95%, t∗ = 2.306, the margin of error mis: m = t∗s/
√n = 2.306× 2.517/
√9 ≈ 1.9
☛ With 95% confidence, the population average percent increase inpolyphenol blood levels of healthy men drinking half a bottle of redwine daily is between 3.6% and 7.4%.Important: The confidence interval shows how large the increase is,but not if it can have an impact on men‘s health.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 206 / 227
The One-Sample t-Test
As in the previous chapter, a test of hypotheses requires a few steps:
☛ Stating the null and alternative hypotheses (H0 versus Ha)
☛ Deciding on a one-sided or two-sided test
☛ Choosing a significance level α
☛ Calculating t and its degrees of freedom
☛ Finding the area under the curve with Table D
☛ Stating the p-value and interpreting the result
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 207 / 227
The One-Sample t-Test — Cont’d
☛ The p-value is the probability, if H0 is true, of randomly drawing asample like the one obtained or more extreme, in the direction of Ha;or represents the probability that random fluctuations alone couldhave generated results that differed from H0, in the direction of Ha,by at least as much as what you observed in your data.
☛ The p-value is calculated as the corresponding area under the curve,one-tailed or two-tailed depending on Ha:
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 208 / 227
The One-Sample t-Test — Cont’d
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 209 / 227
Example: Sweetening Colas — Cont’d
Is there evidence that storage results in sweetness loss for the new colarecipe at the 0.05 level of significance (α = 5%)?H0 : µ = 0 versus Ha : µ > 0 (one-sided test)
t =x − µ0
s/√n
=1.02− 0
1.196/√10
= 2.70
☛ The critical value tα = 1.833. t > tα thus the result is significant.
☛ Or 2.398 < t = 2.70 < 2.821 thus 0.02 > p > 0.01. p < α thus theresult is significant.
☛ The t-test has a significant p-value. We reject H0. There is asignificant loss of sweetness, on average, following storage.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 210 / 227
Matched Pairs t Procedures
Sometimes we want to compare treatments or conditions at the individual level.These situations produce two samples that are not independent — they arerelated to each other. The members of one sample are identical to, or matched(paired) with, the members of the other sample.
☛ Example: Pre-test and post-test studies look at data collected on the samesample elements before and after some experiment is performed.
☛ Example: Twin studies often try to sort out the influence of genetic factorsby comparing a variable between sets of twins.
☛ Example: Using people matched for age, sex, and education in social studiesallows canceling out the effect of these potential lurking variables.
In these cases, we use the paired data to test the difference in the two populationmeans. The variable studied becomes Xdifference = X1 − X2, andH0 : µdifference = 0;Ha : µdifference > 0(or < 0, or 6= 0).Conceptually, this is not different from tests on one population.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 211 / 227
Example: Does Lack of Caffeine Increase Depression?
☛ Individuals diagnosed as caffeine-dependent are deprived of caffeine-rich foods andassigned to receive daily pills. Sometimes, the pills contain caffeine and other times theycontain a placebo. Depression was assessed.
☛ There are 2 data points for each subject, but well only look at the difference.
☛ For each individual in the sample, we have calculated a difference in depression score(placebo minus caffeine). There were 11 “difference” points, thus df= n − 1 = 10.x = 7.36; s = 6.92
☛ H0 : µdifference = 0;H0 : µdifference > 0.
☛ t = x−0s/
√n= 3.53
☛ For df = 10, 3.169 < t = 3.53 < 3.581 =⇒ 0.005 > p > 0.0025.Caffeine deprivation causes a significant increase in depression
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 212 / 227
Robustness
☛ The t procedures are exactly correct when the population isdistributed exactly normally. However, most real data are not exactlynormal.
☛ The t procedures are robust to small deviations from normality — theresults will not be affected too much. Factors that strongly matter:
☞ Random sampling. The sample must be an SRS from thepopulation.
☞ Outliers and skewness. They strongly influence the mean andtherefore the t procedures. However, their impact diminishes asthe sample size gets larger because of the Central Limit Theorem.
☛ Specifically:
☞ When n < 15, the data must be close to normal and withoutoutliers.
☞ When 15 ≤ n ≤ 40, mild skewness is acceptable but not outliers.☞ When n > 40, the t-statistic will be valid even with strong
skewness.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 213 / 227
Power of the t-test
☛ The power or sensitivity of a binary hypothesis test is the probabilitythat the test correctly rejects the null hypothesis (H0) when thealternative hypothesis (Ha) is true.
☛ The power of the one sample t-test for a specific alternative value ofthe population mean µ, assuming a fixed significance level α, is theprobability that the test will reject the null hypothesis when thealternative value of the mean is true.
☛ Calculation of the exact power of the t-test is a bit complex. But anapproximate calculation that acts as if α were known is almost alwaysadequate for planning a study. This calculation is very much like thatfor the z-test.
☞ When guessing α, it is always better to err on the side of astandard deviation that is a little larger rather than smaller. Wewant to avoid failing to find an effect because we did not haveenough data.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 214 / 227
Inference for Non-Normal Distributions
What if the population is clearly non-normal and your sample is small?
☛ If the data are skewed, you can attempt to transform the variable tobring it closer to normality (e.g., logarithm transformation). Thet−procedures applied to transformed data are quite accurate for evenmoderate sample sizes.
☛ A distribution other than a normal distribution might describe yourdata well. Many non-normal models have been developed to provideinference procedures too.
☛ You can always use a distribution-free (nonparametric) inferenceprocedure that does not assume any specific distribution for thepopulation. But it is usually less powerful than distribution-driventests (e.g., t−test).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 215 / 227
Nonparametric Method: the Sign Test for Median
☛ A distribution-free test usually makes a statement of hypothesesabout the median rather than the mean.
☛ A simple distribution-free test is the sign test for matched pairs.
☛ Assume that our random variable X is a continuous random variablewith unknown median m.
☛ Upon taking a random sample X1,X2, · · · ,Xn, we’ll be interested intesting whether the median m takes on a particular value m0.
H0 : m = m0;Ha : m > m0 or Ha : m < m0 or Ha : m 6= m0
☛ Considering the quantity Xi −m0 for i = 1, 2, · · · , n. If the nullhypothesis is true, that is, m = m0, then we should expect about halfof the xi −m0 quantities obtained to be positive and half to benegative.
☛ This analysis of Xi −m0 under the three situations m = m0,m > m0,and m < m0 suggests that a reasonable test for testing the value of amedian m should depend on Xi −m0.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 216 / 227
The Sign Test for a Median — Steps
☛ Calculate the matched difference for each individual in the sample.
☛ Ignore pairs with difference 0.
☛ The number of trials n is the count of the remaining pairs.
☛ Record N− = the number of negative signs and N+ = the number ofpositive signs.If the null hypothesis is true (m = m0), then N− and N+ both followa binomial distribution with parameters n and p = 1
2 .
☛ Calculate p-value and make statement:
☞ For Ha : m > m0, reject the H0 if n−, or alternatively,p-value=P(N− ≤ n−) is small.
☞ For Ha : m < m0, reject the H0 if n+, or alternatively,p-value=P(N+ ≤ n+) is small.
☞ For Ha : m 6= m0, reject the H0 if min(n−, n+), or alternatively,p-value=2P(Nmin ≤ min(n−, n+)) is small.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 217 / 227
Example — the Sign Test for a Median
☛ Question: A random sample of 20 numbers
9.4, 13.4, 15.6, 16.2, 16.4, 16.8, 18.1, 18.7, 18.9, 19.1,
19.3, 20.1, 20.4, 21.6, 21.9, 23.4, 23.5, 24.8, 24.9, 26.8
Is there sufficient evidence to conclude that the median is smaller than 22?
☛ Solution: testing the null hypothesis H0 : m = 22 against the alternativehypothesis Ha : m < 22.
☞ First calculate xi − 22.☞ The observed number of positive signs is n+ = 5. Therefore, we need to
calculate how likely it would be to observe as few as 5 positive signs if thenull hypothesis were true.
☞ The p-value is P(X <= 5)
= P(X = 5) + P(X = 4) + P(X = 3) + P(X = 2) + P(X = 1) + P(X = 0)
= 0.0207 < 0.05.
☞ There is sufficient evidence, at the 0.05 level, to conclude that the median issmaller than 22.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 218 / 227
Comparing Two Samples
☛ Independent samples: subjects in one samples are completelyunrelated to subjects in the other sample.
☛ We often compare two treatments used on independent samples.
☛ Is the difference between both treatments due only to variations fromthe random sampling (B), or does it reflect a true difference inpopulation means (A)?
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 219 / 227
Two-sample z statistic
☛ We have two independent SRSs (simple random samples) possibly comingfrom two distinct populations with (µ1, σ1) and (µ2, σ2). We use x1 and x2to estimate the unknown µ1 and µ2.
☛ When both populations are normal, the sampling distribution of (x1 − x2) isalso normal, with standard deviation:
√
σ21
n1+
σ22
n2.
☛ Then the two-sample z statistic has the standard normal N(0, 1) samplingdistribution.
z =(x1 − x2)− (µ1 − µ2)
√σ21
n1+
σ22
n2
.
☛ The null hypothesis is typically that both population means µ1 and µ2 areequal, thus their difference is equal to zero.
H0 : µ1 = µ2 ⇐⇒ µ1 − µ2 = 0
with either a one-sided or a two-sided alternative hypothesis.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 220 / 227
Two Independent Samples t Distribution
☛ We have two independent SRSs (simple random samples) possibly coming fromtwo distinct populations with (µ1, σ1) and (µ2, σ2) unknown. We use (x1, s1) and(x2, s2) to estimate (µ1, σ1) and (µ2, σ2), respectively.
☛ To compare the means, both populations should be normally distributed. However,in practice, it is enough that the two distributions have similar shapes and that thesample data contain no strong outliers.
☛ The two-sample t statistic follows approximately the t distribution with a standarderror SE (spread) reflecting variation from both samples:
SE =
√
s21n1
+s22n2
.
☛ Conservatively, the degrees of freedom is equal to the smallest of (n1 − 1, n2 − 1).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 221 / 227
Two-Sample t Significance Test
☛ The null hypothesis is that both population means µ1 and µ2 areequal, thus their difference is equal to zero.
H0 : µ1 = µ2 ⇐⇒ µ1 − µ2 = 0
with either a one-sided or a two-sided alternative hypothesis.
☛ We find how many standard errors (SE) away from (µ1 − µ2) is(x1 − x2) by standardizing with t:
t =(x1 − x2)− (µ1 − µ2)
SE.
☛ Because in a two-sample test H0 poses µ1 − µ2=0, we simply use
t =x1 − x2
√s21n1
+s22n2
.
With df = min(n1 − 1, n2 − 1).
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 222 / 227
Example: Two-Sample t Significance Test
☛ We want to know whether parental smoking decreases children’s lung capacity asmeasured by the forced vital capacity (FVC) test. Is the mean FVC lower in thepopulation of children exposed to parental smoking?
Parental smoking FVC x s n
Yes 75.5 9.3 30No 88.2 15.1 30
☛ H0 : µsmoke = µno ⇐⇒ µsmoke − µno = 0,Ha : µsmoke < µno ⇐⇒ µsmoke − µno < 0 (one sided)
☛ The difference in sample averages follows approximately the t distribution:
t
0,
√
s2smoke
nsmoke
+s2no
nno
, df = 29
☛ We calculate the t statistic:
t =xsmoke − xno
√
s2smokensmoke
+s2nonno
=75.5− 88.2
√
9.32
30+ 15.12
30
= −3.9
☛ In table D, for df=29 we find t > 3.659 =⇒ p < 0.0005 (one sided). It’s a very significantdifference, we reject H0, i.e., lung capacity is significantly impaired in children of smokingparents.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 223 / 227
Two-Sample t Confidence Interval
☛ Because we have two independent samples we use the differencebetween both sample averages (x1 − x2) to estimate (µ1 − µ2).
☛ Practical use of t : t∗
☞ C is the area between −t∗ and t∗.☞ We find t∗ in the line of Table D for df = min (n1 − 1; n2 − 1)
and the column for confidence level C .☞ The margin of error m is:
m = t∗
√
s21n1
+s22n2
= t∗SE
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 224 / 227
Common Mistake
☛ A common mistake is to calculate a one-sample confidence intervalfor µ1 and then check whether µ2 falls within that confidenceinterval, or vice-versa.
☛ This is WRONG because the variability in the sampling distributionfor two independent samples is more complex and must take intoaccount variability coming from both samples. Hence the morecomplex formula for the standard error.
SE =
√
s21n1
+s22n2
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 225 / 227
Example: Two-Sample t Confidence Interval
☛ Can directed reading activities in the classroom help improve reading ability? A class of21 third-graders participates in these activities for 8 weeks while a control classroom of 23third-graders follows the same curriculum without the activities. After 8 weeks, allchildren take a reading test (scores in table).
☛ 95% confidence interval for (µ1 − µ2), with df = 20 conservatively, t∗ = 2.086:
CI : (x1 − x2)±m;m = t∗
√
s21
n1+
s22
n2= 2.086× 4.31 = 8.99
With 95% confidence, (µ1 − µ2), falls within 9.96± 8.99 or 1.0 to 18.9.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 226 / 227
Robustness
☛ The two-sample t procedures are more robust than the one-sample tprocedures. They are the most robust when both sample sizes areequal and both sample distributions are similar. But even when wedeviate from this, two-sample tests tend to remain quite robust.
☞ When planning a two-sample study, choose equal sample sizes ifyou can.
☛ As a guideline, a combined sample size (n1 + n2) of 40 or more willallow you to work with even the most skewed distributions.
Zhaoxian Zhou (USM) CSS 211 January 11, 2018 227 / 227