statistics summarizing, visualizing and understanding data
TRANSCRIPT
![Page 1: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/1.jpg)
STATISTICS
Summarizing, Visualizing and Understanding Data
![Page 2: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/2.jpg)
I. Populations, Variables, and Data
![Page 3: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/3.jpg)
Populations and Samples
To a statistician, the population is
the set or collection under investigation. Individual members of the population are not usually of interest. Rather, investigators try to infer with some degree of confidence the general features of the population.
![Page 4: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/4.jpg)
Examples
Students currently enrolled at a certain university.
Registered voters in a certain Congressional district.
The population of large-mouthed bass in a certain lake.
The population of all decay times of a radioactive isotope.
![Page 5: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/5.jpg)
Statistical Inference
Drawing and quantifying the reliability of conclusions about a population from observations on a smaller subset of the population.
Sample: The subset observed.
![Page 6: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/6.jpg)
Variables and Data
A population variable is a descriptive number or label associated with each member of a population.
The values of a population variable are the various numbers (or labels) that occur as we consider all the members of the population.
Values of variables that have been recorded for a population or a sample from a population constitute data.
![Page 7: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/7.jpg)
Types of Data
Nominal variables are variables whose values are labels.
Ordinal variables are variables whose values have a natural order.
Interval variables have values represented by numbers referring to a scale of measurement.
Ratio variables have values that are positive numbers on a scale with a unit of measurement and a natural zero point.
![Page 8: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/8.jpg)
Guess the Type
Age Questionnaire responses: 1=”strongly
agree”,2=”agree”…,5=”strongly disagree”
Letter grades Reading comprehension scores Gender Zip codes Molecular velocities
![Page 9: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/9.jpg)
II. Summarizing Data
![Page 10: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/10.jpg)
Location Measures (Measures of Central Tendency)
A location measure or measure of central tendency for a variable is a single value or number that is taken as representing all the values of the variable. Different location measures are appropriate for different types of data.
![Page 11: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/11.jpg)
The Mean
For interval or ratio variables x N individuals in the sample or population xi = value of x for ith individual
)(1
21 NxxxN
x
The mean of a population variable is denotedby (the Greek letter mu).
![Page 12: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/12.jpg)
The Mean with Repeated Values
Distinct values of x: nj = frequency of occurrence of
Mxxx ,,, 21 jx
)(1
2211 MM xnxnxnN
x
![Page 13: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/13.jpg)
The Mean with Repeated Values
Relative frequencies: N
nf jj
MM xfxfxfx 2211
![Page 14: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/14.jpg)
Example
-2 1 3 4 6
2 1 3 5 3
jx
jn
![Page 15: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/15.jpg)
The Median
Informally, the “middle” value when all the values are arranged in order
A number m is a median of x if at least half the individuals i in the population have
and at least half of them havemxi
mxi
![Page 16: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/16.jpg)
The Median – Example 1
x: –2.0, 1.5, 2.2, 3.1, 5.7 (no repetitions)
median(x)=2.2
![Page 17: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/17.jpg)
The Median – Example 2
x: -2.0, 1.5, 3.1, 3.1, 3.1
median(x) = 3.1
![Page 18: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/18.jpg)
The Median – Example 3
x: -2.0, 1.5, 3.1, 5.7, 5.9, 7.1 median(x)=Any number in [3.1,5.7] By convention, for an even number
of individuals choose the midpoint between the smallest and largest medians, e.g.,
.4.42
7.51.3
m
![Page 19: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/19.jpg)
Example
Change 7.1 to 71. What happens to the mean and the median?
The mean changes from 3.55 to 14.2
No change in the median The median is much less sensitive
to outliers (which may be mistakes in recording data)
![Page 20: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/20.jpg)
A A- B+ B B- C+ C C- D+ D D- F
8 5 10 18 18 15 14 6 4 1 1 0
The Median for Ordered Categories
N=100. The median grade is B-.
![Page 21: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/21.jpg)
The Mode
The data value with the greatest frequency
Not useful for interval or ordinal data if recorded with precision
The only useful location measure for strictly nominal data
![Page 22: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/22.jpg)
A A- B+ B B- C+ C C- D+ D D- F
8 5 10 18 18 15 14 6 4 1 1 0
Example
The modes are B and B-.
![Page 23: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/23.jpg)
Cumulative Frequencies and Percentiles
x is an interval or ratio variable. Ordered distinct values:
Relative frequencies:
Mxxx 21
Mfff ,,, 21
![Page 24: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/24.jpg)
Cumulative Frequencies and Percentiles
Cumulative Frequencies
Cumulative Relative Frequencies
MM nnnN
nnnN
nnN
nN
21
3213
212
11
MM fffF
fffF
ffF
fF
21
3213
212
11
![Page 25: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/25.jpg)
The Weather Person’s Prediction Errors x
x'j -2 1 3 4 6
nj 2 1 3 5 3
Nj 2 3 6 11 14
fj .1429 .0714 .2143 .3571 .2143
Fj .1429 .2143 .4286 .7857 1.000
![Page 26: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/26.jpg)
Exercise
From the table above, what fraction of the data is less than 1? What fraction is greater than 3? What fraction is greater than or equal to 3?
![Page 27: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/27.jpg)
Percentiles
x: an interval or ratio variable A number a is a pth percentile of x if at
least p% of the values of x are less than or equal to a and at least (100-p) % of the values of x are greater than or equal to a.
The 25th percentile is called the first quartile of x and the 75th percentile is the third quartile of x.
The 50th percentile is the second quartile or median.
![Page 28: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/28.jpg)
Example
For the weather person’s errors, the 25th percentile is 3. The 50th percentile and third quartile are both 4.
![Page 29: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/29.jpg)
Measures of Variability
Statisticians are not only interested in describing the values of a variable by a single measure of location. They also want to describe how much the values of the variable are dispersed about that location.
![Page 30: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/30.jpg)
Population Variance and Standard Deviation
x: an interval or ratio variable. N=number of individuals in
population. Variance of x:
Standard deviation of x:
N
xxx N22
22
12 )()()(
2
![Page 31: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/31.jpg)
Sample Variance and Standard Deviation
n: the number of individuals in a sample from a population
Sample variance:
Sample standard deviation:
1
)()()( 222
212
n
xxxxxxs n
2ss
![Page 32: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/32.jpg)
Alternative Formulas for the Variance
Using frequencies:
Using relative frequencies:
N
xnxnxn MM22
222
112 )()()(
2222
211
2 )()()( MM xfxfxf
![Page 33: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/33.jpg)
The Interquartile Range
Q1, Q3 : 1st and 3rd quartiles, respectively
Interquartile range:
Not influenced by a few extremely large or small observations (outliers)
13 QQIQR
![Page 34: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/34.jpg)
The Range
The difference between the largest data value and the smallest
Range of sample values is not a reliable indicator of the range of a population variable
![Page 35: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/35.jpg)
III. Graphical Methods
![Page 36: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/36.jpg)
Pie Charts (Circle Graphs)
Sources: AT&T (1961) The World’s TelephonesR: A language and environment for statistical computing, the R core
development team.
![Page 37: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/37.jpg)
Bar Charts (Bar Graphs)
![Page 38: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/38.jpg)
Pros and Cons
Bar chart has a scale of measurement – more precise information
Pie chart gives more vivid impression of relative proportions, e.g., obvious at a glance that N. America had more than half the telephones in the world.
![Page 39: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/39.jpg)
Stemplots (Stem and Leaf Diagrams)
Stem|Leaves Cumulative Frequency 4 | 7 1 5 | 448889 7 6 | 34789 12 7 | 012234455666888889999 33 8 | 0022234457799 46 9 | 0457 50
Grades of 50 students on a test
![Page 40: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/40.jpg)
Find the Median
Stem|Leaves Cumulative Frequency 4 | 7 1 5 | 448889 7 6 | 34789 12 7 | 012234455666888889999 33 8 | 0022234457799 46 9 | 0457 50
25th and 26th leaves circled. Median = 78
![Page 41: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/41.jpg)
Exercise
Stem|Leaves Cumulative Frequency 4 | 7 1 5 | 448889 7 6 | 34789 12 7 | 012234455666888889999 33 8 | 0022234457799 46 9 | 0457 50
The 1st quartile is 70 and the 3rd quartile is 82.
![Page 42: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/42.jpg)
Boxplots (Box and Whisker Diagrams)
![Page 43: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/43.jpg)
Elements of a Boxplot
box
whisker
quartiles median
outlierlargest
![Page 44: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/44.jpg)
Boxplot Shows Distribution Skewed to the Left
![Page 45: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/45.jpg)
Histograms
For interval or ratio data
Data is grouped into class intervals
Superficially like a bar chart
![Page 46: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/46.jpg)
Frequency Histogram
Source: R: A language and environment for statistical computing, the R core development team.
Class interval (bin)
Height=bin frequency
![Page 47: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/47.jpg)
Probability Histogram
Area of bar = relative bin frequencyE.g., .011×25=.275
![Page 48: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/48.jpg)
Ogives(Cumulative Frequency Polygons)
Related to probability histograms
Examples of cumulative distribution functions
Probability histograms are examples of density functions
![Page 49: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/49.jpg)
Example Ogive
![Page 50: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/50.jpg)
Relationship Between Probability Histogram and Ogive
The height of the ogive is the cumulative area under the histogram
![Page 51: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/51.jpg)
Estimating Percentiles from Ogives
Horizontal line has height .75
Vertical line intersects horizontal axis at 60
Estimated 3rd quartile is 60
True 3rd quartile is 62
![Page 52: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/52.jpg)
Scatterplots (Scatter Diagrams)
Used for jointly observed interval or ratio variables
Example: Heights and weights of individuals
Example: State per capita spending on secondary education and state crime rate
Example: Wind speed and ozone concentration
![Page 53: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/53.jpg)
Example Scatterplot
centroid
![Page 54: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/54.jpg)
Fitting a Line
Relationship between variables x and y is approximately linear.
Approximately, y = a + bx. Find a and b so that data comes
closest to satisfying the equation. Least squares – a formal
mathematical technique to be shown later.
![Page 55: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/55.jpg)
Line Fitted by Least Squares
![Page 56: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/56.jpg)
IV. Sampling
![Page 57: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/57.jpg)
Why Sample?
Because the population is too large to observe all its members.
The population may be partly inaccessible.
The population may even be hypothetical.
![Page 58: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/58.jpg)
Statistical Inference
Drawing conclusions about the population based on observations of a sample.
Reliability of inferences must be quantifiable.
Random sampling allows probability statements about the accuracy of inferences.
![Page 59: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/59.jpg)
Sampling With Replacement
Population has N members. n population members chosen
sequentially. Once chosen, a member of the population
may be chosen again. At each stage, all members of the
population are equally likely to be chosen. Random experiment with possible
equally likely outcomes.
nN
![Page 60: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/60.jpg)
Sampling With Replacement (continued)
x is a population variable.
X1 = value of x for 1st sampled individual, X2 = value of x for 2nd sampled individual, etc.
Each Xi is a random variable. The random variables are independent.
The sequence is a random sample of values of x, or a random sample from the distribution of x.
nXXX ,,, 21
nXXX ,,, 21
![Page 61: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/61.jpg)
Sampling Without Replacement
Population has N individuals. n members chosen sequentially. Once chosen, an individual may not
be chosen again. At each stage, all of the remaining
members are equally likely to be chosen next.
Random experiment with possible equally likely outcomes.
)1()1( nNNN
![Page 62: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/62.jpg)
Sampling Without Replacement (continued)
Sample without replacement. Ignore the order of the sequence of
individuals in the sample. Random experiment whose
outcomes are subsets of size n. Experiment has possible
equally likely outcomes. Common meaning of “random
sample of size n”
)!(!
!, nNn
NC nN
![Page 63: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/63.jpg)
Random Number Generators
Calculators and spreadsheet programs can generate pseudorandom sequences.
Press the random number key of your calculator several times.
Simulates a random sample with replacement from the set of numbers between 0 and 1 (to high precision).
![Page 64: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/64.jpg)
Generating a Sample with Replacement
Number the individuals from 1 to N. Generate a pseudorandom number
R. Include individual i in the sample if
Repeat n times. Individuals may be included more than once.
iNRi 1
![Page 65: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/65.jpg)
Exercise
Suppose you have 30 students in your class. Use the procedure just described to obtain a sample of size 10 (a) with replacement, (b) without replacement.
![Page 66: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/66.jpg)
V. Estimation
![Page 67: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/67.jpg)
The Sample Mean and Standard Deviation
is a random sample from the distribution of a population variable x.
The sample mean is
The sample variance is
nXXX ,,, 21
)(1
21 nXXXn
X
])()()[(1
1 222
21
2 XXXXXXn
S n
![Page 68: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/68.jpg)
The Sample Mean and Standard Deviation (continued)
The sample standard deviation is
The sample mean, variance and standard deviation are all random variables because they depend on the outcome of the random sampling experiment.
2SS
![Page 69: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/69.jpg)
Estimators
The sample mean, variance, and standard deviation have distributions derived from the distribution of values of the population variable x.
They are estimators of the population mean , the population variance 2, and the population standard deviation of x.
![Page 70: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/70.jpg)
Unbiased Estimators
The theoretical expected values of the sample mean and sample variance are equal to their population counterparts, i.e.,
and S2 are said to be unbiased estimators of and 2, respectively
S is biased.
)(XE 22 )( SEand
X
)(SE
![Page 71: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/71.jpg)
The Distribution of the Random Variable
The mean of is , the same as the mean of the population variable x.
The standard deviation of is
These are the theoretical mean and standard deviation.
X
X
X ./ n
![Page 72: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/72.jpg)
Density Functions
A density function is a nonnegative function such that the total area between the graph of the function and the horizontal axis is 1.
A probability histogram is a density function.
Other density functions are limits of histograms as the number of data elements grows without bound.
![Page 73: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/73.jpg)
The Standard Normal Density Function
![Page 74: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/74.jpg)
Percentiles of the Standard Normal Distribution
za is the 100(1-) percentile of the distribution
![Page 75: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/75.jpg)
Symmetry About the Vertical Axis
![Page 76: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/76.jpg)
Probabilities Related to the Standard Normal Distribution
![Page 77: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/77.jpg)
Other Normal Distributions
Let Z be a random variable with the standard normal distribution.
The mean of Z is 0 and the standard deviation of Z is 1.
Let and be any numbers, >0. Let Y =Z+ Y has the normal distribution with
mean and standard deviation .
![Page 78: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/78.jpg)
Other Normal Distributions Example
= 1 and = 1.5
![Page 79: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/79.jpg)
Standardizing: The Inverse Operation
Let Y be normally distributed with mean and standard deviation .
Let . This is the z-score of Y.
Then Z has the standard normal distribution and
Y
Z
][][
bZ
aPbYaP
![Page 80: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/80.jpg)
The Central Limit Theorem
Let be the sample average of a random sample of n values of a population variable x.
The population variable x has mean and standard deviation .
Standardize by subtracting its mean and dividing by its standard deviation
X
X
)(
/
Xn
n
XZ
![Page 81: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/81.jpg)
The Central Limit Theorem (continued)
Get Ready for the Central Limit Theorem!
![Page 82: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/82.jpg)
The Central Limit Theorem(continued)
The Central Limit Theorem: As the sample size n grows without
bound, the distribution of Z approaches the standard normal distribution. This is true no matter what the distribution of values of the population variable x.
![Page 83: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/83.jpg)
Another Statement of the CLT
For sufficiently large sample sizes n and for all numbers a and b,
In almost all applications, n≥50 is large enough.
])()(
[][
bnZ
anPbXaP
![Page 84: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/84.jpg)
The CLT in Action
Sample n=30 from the population variable COUNTS whose distribution is tabulated. Calculate the sample average. Repeat this 500 times and construct a histogram of the z-scores of the 500 sample averages. Note: The distribution of COUNTS is very far from normal.
xj 0 1 2 3 4 5 6
fj .36 .33 .19 .08 .02 .01 .01
![Page 85: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/85.jpg)
Distribution of COUNTS
![Page 86: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/86.jpg)
Result-500 Averages of 30 Samples from COUNTS
![Page 87: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/87.jpg)
Estimating a Population Mean
The sample mean is an unbiased estimator of the population mean .
For “large” sample sizes n, has approximately a normal distribution with mean and standard deviation
For large n, the sample mean is an accurate estimator of the population mean with high probability.
X
X
n/
![Page 88: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/88.jpg)
Example
Suppose and we want to estimate with an error no greater than 0.05.
Assume is exactly normally distributed. Standardize.
X
X
]025.0|[|]05.0|[| nZPXP
![Page 89: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/89.jpg)
Probabilities of 1-place Accuracy = 2
![Page 90: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/90.jpg)
Confidence Intervals for the Population Mean – Review of 2/z
![Page 91: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/91.jpg)
100(1-)% Confidence Interval
By the CLT
Rearranging the inequalities
][1 2/2/n
zXn
zXP
]/
[1 2/2/ zn
XzP
![Page 92: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/92.jpg)
A Difficulty
is probably unknown, so the confidence interval
can’t be used. What to do?
nzX
2/
![Page 93: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/93.jpg)
Enhanced Central Limit Theorem
Define the modified z-score for as
As n grows without bound, the distribution of Z approaches the standard normal distribution.
X
S
Xn
nS
XZ
)(
/
![Page 94: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/94.jpg)
A More Useful Confidence Interval
By the enhanced CLT
An approximate 100(1-)% confidence interval is
][1 2/2/n
SzX
n
SzXP
n
SzX 2/
![Page 95: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/95.jpg)
Example
n=50 from COUNTS ( = 1.14) = 1.32 S = 1.39 1- = .95 = =1.32±0.39
95% confidence interval: (0.93, 1.71) Don’t say .95=P[0.93<<1.71]
X
n
SzX 2/
50
39.196.132.1
![Page 96: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/96.jpg)
Confidence Intervals for Proportions
x is a population variable with only two values, 0 and 1.
Numerical code for two mutually exclusive categories, e.g., “male” and “female”, or “approves” and “disapproves”.
p=relative frequency of x=1. =p; 2=p(1-p)
![Page 97: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/97.jpg)
Confidence Intervals for Proportions (continued)
Sample n values of x, with replacement. Result is a sequence of 1s and 0s.
Sample mean is the relative frequency in the sample of 1s, e.g., the relative frequency of females in the sample of individuals.
Denote the sample mean by since it is an estimator of p.
p̂
![Page 98: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/98.jpg)
Confidence Intervals for Proportions(continued)
By the enhanced CLT, is
approximately standard normal.
An approximate 100(1-
confidence interval is
)ˆ1(ˆ
)ˆ(
pp
ppnZ
n
ppzp
)ˆ1(ˆˆ 2/
![Page 99: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/99.jpg)
Example
A public opinion research organization polled 1000 randomly selected state residents. Of these, 413 said they would vote for a 1¢ sales tax increase dedicated to funding higher education. Find a 90% confidence interval for the proportion of all voters who would vote for such a proposal.
![Page 100: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/100.jpg)
Solution
n = 1000
1-=.90;
= 0.413 ± 1.645
(0.387, 0.489)
413.01000
413ˆ p
645.105.2
zz
n
ppzp
)ˆ1(ˆˆ 2/
1000
587.0413.0
![Page 101: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/101.jpg)
Linear Regression and Correlation
x and y are jointly observed numeric variables, i.e., defined for the same population or arising from the same experiment.
Have observations for n individuals or outcomes.
Data: )11 ,(,),,( nn yxyx
![Page 102: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/102.jpg)
Examples
(An observational study) Let x be the height and y the weight of individuals from a human population.
(A designed experiment) Let x be the amount of fertilizer applied to a plot of cotton seedlings and let y be the weight of raw cotton harvested at maturity.
![Page 103: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/103.jpg)
Data on Fertilizer and Cotton Yield
x 2 2 2 4 4 4 6 6 6 8 8 8
y 2.3 2.2 2.2 2.5 2.9 2.7 3.4 2.7 3.4 3.5 3.4 3.3
![Page 104: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/104.jpg)
Scatterplot of Fertilizer vs. Yield
![Page 105: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/105.jpg)
Assumptions of Linear Regression
There is a population or distribution of values of y
for any particular value of x.
There are unknown constants a and b so that for
any particular value of x, the mean of all the
corresponding values of y is
The standard deviation of the values of y
corresponding to a value of x is the same for all
values of x.
bxay
![Page 106: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/106.jpg)
The Method of Least Squares
Estimate a and b by choosing them to minimize the sum of squared differences between the observed values yi and their putative expected values
In symbols, minimize ibxa
2222
211 )()()( nn bxaybxaybxay
![Page 107: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/107.jpg)
The Least Squares Estimates
Let and be the means of the observed x’s
and y’s. Let be the sample variance of the x’s.
The covariance between the x’s and the y’s is
The least squares estimate of the slope is
The least squares estimate of the intercept is
x y
)])(())([(1
111 yyxxyyxx
ns nnxy
2x
xy
s
sb
2xs
xbya
![Page 108: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/108.jpg)
Least Squares Line for Cotton Yield
![Page 109: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/109.jpg)
Correlation
The correlation between the x’s and y’s is
r is related to the slope b of the least squares regression line by
r is always between -1 and 1. r measures how nearly linear the relationship between x and y is. If r = 0, then x and y are uncorrelated.
yx
xy
ss
sr
y
x
s
sbr
![Page 110: STATISTICS Summarizing, Visualizing and Understanding Data](https://reader035.vdocument.in/reader035/viewer/2022062314/56649e205503460f94b0c3bc/html5/thumbnails/110.jpg)
Examples