what is statistics? - math.wsu.edu · stat360 (wang) ch1 3 ch1 4 a sample is a collection of...
TRANSCRIPT
CH1: Overview and Descriptive Statistics
CH1 1STAT360 (Wang)
What is Statistics?
Statistics covers the collection, analysis and interpretation of data: • Design experiments to collect data. • Extract information from data. • Make decisions and predictions in the presence of uncertainty and variation.
CH1 2STAT360 (Wang)
• The population is the entire collection of units about which we are interested.
• Example: All babies born in the United States.
CH1 3STAT360 (Wang) CH1 4
A sample is a collection of persons or things on which we measure one or more variables.
Example: 150 babies born in a certain hospital
The number of observations in a sample is called the sample size and is denoted by the letter n.
Example: The birth weights of 150 babies born in a certain hospital. (n=150)
STAT360 (Wang)
CH1 5
Target Population
SampleSampled Population
Target Population is the population you are interested.
Sampled Population is the population you actually do sample.
Warning: You make inference to the population from which you sampled.
STAT360 (Wang) CH1 6
Random Sampling
• A simple random sample is a sample where all the members of the population have an equal chance of selection and the selection of one member is independent of the selection of another.
• Random sampling allows us to make inference back to a population.
• We can use R, Excel or other software packages to generate random numbers or randomly sample a column of values.
STAT360 (Wang)
• An experimental unit is an individual item in a population that data is being collected from.
• A variable of interest is information of interest about each individual item in a population.
CH1 7STAT360 (Wang) CH1 8
A variable is information of interest about each individual item in a population. Examples: Height, Weight, Age, Gender
Categorical/Qualitative Variables are those we can place into categories. Examples: Eye Color, Gender
Numerical/Quantitative Variables are those for which we can record a numerical value and then order respondents according to those values. Examples: Age, Time
STAT360 (Wang)
CH1 9
Discrete Variables have values that can be obtained by counting. Examples: Number of Children, Age (in years)
Continuous Variables can take any value within a given interval. Examples: Height, Weight
VARIABLES
CATEGORICAL NUMERICAL
DISCRETE CONTINUOUS
STAT360 (Wang)
• Population parameter: a numeric value describing a property of a population such as mean, median and variance.
• Sample statistics: also referred to as simply "statistics" are numeric values describing, or summarizing, a data set. Sample statistics are also referred to as estimators because one of the purposes of statistics is to estimate population parameters.
CH1 10STAT360 (Wang)
CH1 11
Descriptive Statistics is the group of techniques for summarizing and describing important features of the data.
Inferential Statistics is the group of techniques for generalizing from a sample to a population.
STAT360 (Wang) CH1 12
Example
The Denver Post reported on a study examining the effect of diet on the life span of fruit flies (Low-cal diet extends fruit-fly life at any age, 9/19/03).
The study considered 7,492 fruit flies. Half were randomly assigned to a low-cal diet and the other half received a regular diet. The number of days lived was recorded for each fruit fly.
STAT360 (Wang)
CH1 13
What is the sample and sample size?
The explanatory variable is whether or not a fly received the low-cal diet. Is this a categorical of numerical variable?
The outcome variable is the number of days lived. Is this a categorical or numerical variable?
STAT360 (Wang) CH1 14
Descriptive Statistics: Measures of Center
The mean is simply the average. To find the mean we add up (sum) all the values and divide by the number of values:
For the radish data the sample mean is 22.14.
nx
n
xx i
n
ii ∑∑== =1
STAT360 (Wang)
=
CH1 29
A physical interpretation of x demonstrates how it measures the location (center) of a sample. Think of drawing and scaling a horizontal measurement axis, and then represent each sample observation by a 1-lb weight placed at the corresponding point on the axis.
The only point at which a fulcrum can be placed to balance the system of weights is the point corresponding to the value of x (see Figure 1.14).
Figure 1.14 The mean as the balance point for a system of weightsSTAT360 (Wang) 15 CH1 16
The median is the value that most nearly lies in the middle of the sample.
From a list of ordered observations, the median is the middle value (if n is odd) or the average of the two middle values (if n is even).
Sorted Radish Data: 8, 10, 11, 15, 15, 20, 20, 22, 25, 29, 30, 33, 35, 37
For the radish data, the median is
STAT360 (Wang)
CH1 17
Example: A prospective employee at a laboratory is told that the mean hourly wage is $16/hour. The actual wages of employees at the lab are: $10, $10, $10, $10, $40. Since only the manager earns more than $10/hour, the median is a better measure of center in this case!
The mean is more affected by outliers in the sample. Another way to say this is that the median is more resistant.
If the total is of interest, then the mean is probably the appropriate summary value. A rancher would be interested in mean cattle weight, because total weight translates to profit.
In many other cases, the median does a better job of representing the center.
STAT360 (Wang) CH1 18
• The trimmed mean is a compromise between the mean and the median.
• A 10% trimmed mean is calculated by eliminating the smallest 10% and largest 10% of the sample and then averaging over what is left.
STAT360 (Wang)
CH1 19
Graphical Displays
• Pie Chart (Categorical) • Bar Graph (Categorical) • Stem and Leaf Plot (Numerical) • Histogram (Numerical) • Box Plot (Numerical)
STAT360 (Wang) CH1 20
Example: Students Majors
Major Frequency Relative FrequencyC&B Eng 6 0.082Eng Sc 3 0.041Math 13 0.178Mech Eng 37 0.507Other 11 0.151Physics 3 0.041Total 73 1.000
I recorded information about the majors of students in one class.
STAT360 (Wang)
CH1 21
Pie ChartsPie charts are useful when only one categorical variable is observed.
CBE
Engineering Science
Mathematics
Mechanical Engineering
Other
Physics
STAT315: Majors
STAT360 (Wang) CH1 18
Bar ChartsBar Charts also show percentages or frequencies in various categories, but they can be used to represent multiple categorical variables simultaneously.
CBE Engineering Science Mathematics Mechanical Engineering Other Physics
Major
Freq
uenc
y
05
1015
2025
3035
STAT360 (Wang)
CH1 23
Stem and Leaf Plots
• Stem and leaf plots are used to represent a single numerical variable.
• They are typically used when the sample size is small.
• Procedure: 1. Sort the data. 2. Create the stems. 3. Add the leaves.
STAT360 (Wang) CH1 24
Example: Radish Growth
This data represents the length (in mm) of radishes grown in total darkness for three days.
15 20 11 30 3320 29 35 8 1022 37 15 25
STAT360 (Wang)
CH1 25
1. Order the Data:
2. Choose the stems: Use the 10’s digit.3. Add the leaves.
0 |
1 |
2 |
3 |
Stem |Leaves
STAT360 (Wang) CH1 26
Histograms1.Divide the data up into a Frequency/Relative
Frequency table: - Divide the data into groups of equal width.
- Lists the frequency and relative frequency for each group. Frequency = total number in that group. Relative Frequency = percent of the total in that group.
2. Draw a bar corresponding to each row in the table, with height corresponding to frequency or relative frequency.
STAT360 (Wang)
CH1 27
Sorted Radish Data:
8, 10, 11, 15, 15, 20, 20, 22, 25, 29, 30, 33, 35, 37
Frequency Histogram:
Frequency and Relative Frequency Table:
Length Frequency Relative Frequency< 1010 - 1920 - 2930 - 39
Radish$Length
frequency
0 10 20 30 40
01
23
45
STAT360 (Wang) CH1 28
Frequency Distributions: Shapes and Examples
When investigators speak about the "shape" of the data, they are referring to the shape of the histogram resulting from the data.
Symmetric Data Sets: A data set for which the histogram is (approximately) symmetric.
Symmetric Histogram
0
2
4
6
8
10
STAT360 (Wang)
CH1 29
A data set is referred to as unimodal if there is a single prominent peak in the histogram. An example is the Symmetric histogram (previous slide).
A data set is referred to as bimodal is there are two prominent peaks in the histogram.
Bimodal Histogram
0
2
4
6
8
10
STAT360 (Wang) CH1 30
A Skewed Data Set is one that is basically unimodal but is substantially off from being bell-shaped.
Skewed Left Histogram
0
2
4
6
8
10
12
Skewed Right Histogram
0
2
4
6
8
10
12
NOTE: The direction of the skew is in the direction of the long tail.
STAT360 (Wang)
CH1 31
Figure 1.11 shows “smoothed” histograms, obtained by superimposing a smooth curve on the rectangles, that illustrate the various possibilities.
(a) symmetric unimodal (b) bimodal
(c) Positively skewed (d) negatively skewed
STAT360 (Wang) CH1 32
The Five Number Summary and Boxplots
The Five Number Summary: - minimum (smallest value) - Q1 (first quartile) - median - Q3 (third quartile) - maximum (largest value)
Note that the terms (quartiles, interquartile range) and notation (Q1,Q3) used here are different that the book’s notation (fourths and fs).
STAT360 (Wang)
CH1 33
• Divide the data into two equal high and low groups at the median.
• The median is included in both the upper or lower groups when n is odd.
• Find the median of the low group. This is called the lower quartile or Q1
•The median of the high group is the third quartile or Q3.
Upper and Lower Quartiles (Q1 and Q3)
STAT360 (Wang) CH1 34
Example: Radish Data
SORT THE DATA: 8, 10, 11, 15, 15, 20, 20, 22, 25, 29, 30, 33, 35, 37
Min =
Q1 =
Median=
Q3 =
Max =
NOTE: Some authors and computer packages define the quartiles differently. If n is large there is little practical difference between the definitions. When n is small the difference is noticeable.
STAT360 (Wang)
CH1 35
Boxplots A boxplot features the values from the five
number summary (plus some additional information).
1015
2025
3035
Length
Max = 37
Q3 = 30
Median = 21
Min = 8
Q1 = 15
STAT360 (Wang) CH1 36
Creating a Boxplot1. Draw a box with ends at the lower and upper quartiles. 2. Draw a line in the box at the median. 3. Compute the width of the box; this is the interquartile range: IQR= Q3 - Q1 4. Draw the whiskers at each end with length equal to 1.5 times the interquartile range (IQR); if the minimum or maximum occurs before the full length is used, stop there. 5. Use an asterisk to indicate any additional data points beyond the range covered by the box and whiskers.STAT360 (Wang)
CH1 37
Example: 11 male college students were asked how many hours per week they exercise. The (sorted) responses were:
0, 1, 2, 2, 3, 4, 4.5, 5, 6, 8, 15
The five number summary is: Min= 0, Q1=2, Median=4, Q3=5.5, Max=15
STAT360 (Wang) CH1 38
0 5 10 15Hours of Exercise
1. and 2. Draw a box with lines at Q1, median and Q3.
3. Calculate the IQR= Q3-Q1=
5. Add asterisks for outliers.
4. Whiskers…1.5xIQR =
Lower whisker to Q1-1.5*IQR or min =
Upper whisker to Q3+1.5*IQR or max =
Q1=2 M=4 Q3=5.5min=0 max=15
STAT360 (Wang)
CH1 39
Here we use side-by-side boxplots to compare the distribution of length for radishes grown in three days of darkness versus those grown in three days of light.
STAT360 (Wang) CH1 40
Measures of VariabilityThe mean and median for the distributions shown here are nearly the same. However, the spread in greater for Histogram #2.
How do we summarize the variability?
STAT360 (Wang)
CH1 41
Two simple measures of spread are:
• Range = Max – Min
• Interquartile Range (IQR) = Q3 – Q1
STAT360 (Wang)
Descriptive Statistics: Measures of Spread
CH1 42
Variance and Standard Deviation
• Variance and standard deviation are both measures of spread.
• The sample variance is computed as:
• The sample standard deviation is computed as:
11)( 2
2
−=
−
−= ∑
nS
nxx
s xxi
2ss =
STAT360 (Wang)
CH1 43
• The standard deviation is like the “average” distance to the mean.
• As the data spreads out, the variance and standard deviation increase.
• As the data becomes more concentrated, the variance and standard deviation decrease.
STAT360 (Wang) CH1 44
We will compute the variance and standard deviation for a very small sample (n=6):
2, 3, 5, 6, 7, 7
The mean of these values is
2 -3 93 -2 45 0 06 1 17 2 47 2 4
x xx − 2)( xx −
.5=x
22)( 2 =−∑ xxi
4.4522
1)( 2
2 ==−
−= ∑
nxx
s i
10.24.42 === ssSo, the sample variance is 4.4 and the sample standard deviation is 2.10
STAT360 (Wang)
CH1 45
Computing Formula for s2
It is best to obtain s2 from statistical software or else use a calculator that allows you to enter data into memory and then view s2 with a single keystroke. If your calculator does not have this capability, there is an alternative formula for Sxx that avoids calculating the deviations.
The formula involves both summing and then squaring, and squaring and then summing.
An alternative expression for the numerator of s2 is
STAT360 (Wang) STAT360 (Wang) CH1 46
Key Words
population, sample, variable of interest population parameter, sample statistics, measure of center, measure of spread Graphical Displays
47 CH1 48
Inferential Statistics
Inferential Statistics is the process of drawing conclusions about a population, based on observations in the sample from that population.
STAT360 (Wang)
CH1 49
Describing a PopulationBecause observations are made only on a sample, characteristics of biological populations are almost never exactly known. Typically, our knowledge of population characteristics come from a sample. Just as each sample has a distribution, mean and standard deviation, so also we can envision a population distribution, mean, and standard deviation. Population characteristics are called parameters.
STAT360 (Wang) CH1 50
Suppose that a researcher is interested in estimating the proportion of American adults who have diabetes. Based on a sample of 1000 individuals they found that 72 are diabetic.
What proportion of adults in the sample are diabetic?
What is the (target) population?
Do you think that exactly 7.2% of American adults have diabetes?
STAT360 (Wang)
CH1 51
MeasureSample Value (Estimate)
Population Value (Parameter)
Proportion (p hat)
Mean (x bar) (mu)
Standard deviation
s (sigma)xp̂ p
µσ
STAT360 (Wang)