what is statistics? - math.wsu.edu · stat360 (wang) ch1 3 ch1 4 a sample is a collection of...

13
CH1: Overview and Descriptive Statistics CH1 1 STAT360 (Wang) What is Statistics? Statistics covers the collection, analysis and interpretation of data: • Design experiments to collect data. • Extract information from data. • Make decisions and predictions in the presence of uncertainty and variation. CH1 2 STAT360 (Wang) • The population is the entire collection of units about which we are interested. Example: All babies born in the United States. CH1 3 STAT360 (Wang) CH1 4 A sample is a collection of persons or things on which we measure one or more variables. Example: 150 babies born in a certain hospital The number of observations in a sample is called the sample size and is denoted by the letter n. Example: The birth weights of 150 babies born in a certain hospital. (n=150) STAT360 (Wang)

Upload: others

Post on 29-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What is Statistics? - math.wsu.edu · STAT360 (Wang) CH1 3 CH1 4 A sample is a collection of persons or things on which we measure one or more variables. Example: 150 babies born

CH1: Overview and Descriptive Statistics

CH1 1STAT360 (Wang)

What is Statistics?

Statistics covers the collection, analysis and interpretation of data: • Design experiments to collect data. • Extract information from data. • Make decisions and predictions in the presence of uncertainty and variation.

CH1 2STAT360 (Wang)

• The population is the entire collection of units about which we are interested.

• Example: All babies born in the United States.

CH1 3STAT360 (Wang) CH1 4

A sample is a collection of persons or things on which we measure one or more variables.

Example: 150 babies born in a certain hospital

The number of observations in a sample is called the sample size and is denoted by the letter n.

Example: The birth weights of 150 babies born in a certain hospital. (n=150)

STAT360 (Wang)

Page 2: What is Statistics? - math.wsu.edu · STAT360 (Wang) CH1 3 CH1 4 A sample is a collection of persons or things on which we measure one or more variables. Example: 150 babies born

CH1 5

Target Population

SampleSampled Population

Target Population is the population you are interested.

Sampled Population is the population you actually do sample.

Warning: You make inference to the population from which you sampled.

STAT360 (Wang) CH1 6

Random Sampling

• A simple random sample is a sample where all the members of the population have an equal chance of selection and the selection of one member is independent of the selection of another.

• Random sampling allows us to make inference back to a population.

• We can use R, Excel or other software packages to generate random numbers or randomly sample a column of values.

STAT360 (Wang)

• An experimental unit is an individual item in a population that data is being collected from.

• A variable of interest is information of interest about each individual item in a population.

CH1 7STAT360 (Wang) CH1 8

A variable is information of interest about each individual item in a population. Examples: Height, Weight, Age, Gender

Categorical/Qualitative Variables are those we can place into categories. Examples: Eye Color, Gender

Numerical/Quantitative Variables are those for which we can record a numerical value and then order respondents according to those values. Examples: Age, Time

STAT360 (Wang)

Page 3: What is Statistics? - math.wsu.edu · STAT360 (Wang) CH1 3 CH1 4 A sample is a collection of persons or things on which we measure one or more variables. Example: 150 babies born

CH1 9

Discrete Variables have values that can be obtained by counting. Examples: Number of Children, Age (in years)

Continuous Variables can take any value within a given interval. Examples: Height, Weight

VARIABLES

CATEGORICAL NUMERICAL

DISCRETE CONTINUOUS

STAT360 (Wang)

• Population parameter: a numeric value describing a property of a population such as mean, median and variance.

• Sample statistics: also referred to as simply "statistics" are numeric values describing, or summarizing, a data set. Sample statistics are also referred to as estimators because one of the purposes of statistics is to estimate population parameters.

CH1 10STAT360 (Wang)

CH1 11

Descriptive Statistics is the group of techniques for summarizing and describing important features of the data.

Inferential Statistics is the group of techniques for generalizing from a sample to a population.

STAT360 (Wang) CH1 12

Example

The Denver Post reported on a study examining the effect of diet on the life span of fruit flies (Low-cal diet extends fruit-fly life at any age, 9/19/03).

The study considered 7,492 fruit flies. Half were randomly assigned to a low-cal diet and the other half received a regular diet. The number of days lived was recorded for each fruit fly.

STAT360 (Wang)

Page 4: What is Statistics? - math.wsu.edu · STAT360 (Wang) CH1 3 CH1 4 A sample is a collection of persons or things on which we measure one or more variables. Example: 150 babies born

CH1 13

What is the sample and sample size?

The explanatory variable is whether or not a fly received the low-cal diet. Is this a categorical of numerical variable?

The outcome variable is the number of days lived. Is this a categorical or numerical variable?

STAT360 (Wang) CH1 14

Descriptive Statistics: Measures of Center

The mean is simply the average. To find the mean we add up (sum) all the values and divide by the number of values:

For the radish data the sample mean is 22.14.

nx

n

xx i

n

ii ∑∑== =1

STAT360 (Wang)

=

CH1 29

A physical interpretation of x demonstrates how it measures the location (center) of a sample. Think of drawing and scaling a horizontal measurement axis, and then represent each sample observation by a 1-lb weight placed at the corresponding point on the axis.

The only point at which a fulcrum can be placed to balance the system of weights is the point corresponding to the value of x (see Figure 1.14).

Figure 1.14 The mean as the balance point for a system of weightsSTAT360 (Wang) 15 CH1 16

The median is the value that most nearly lies in the middle of the sample.

From a list of ordered observations, the median is the middle value (if n is odd) or the average of the two middle values (if n is even).

Sorted Radish Data: 8, 10, 11, 15, 15, 20, 20, 22, 25, 29, 30, 33, 35, 37

For the radish data, the median is

STAT360 (Wang)

Page 5: What is Statistics? - math.wsu.edu · STAT360 (Wang) CH1 3 CH1 4 A sample is a collection of persons or things on which we measure one or more variables. Example: 150 babies born

CH1 17

Example: A prospective employee at a laboratory is told that the mean hourly wage is $16/hour. The actual wages of employees at the lab are: $10, $10, $10, $10, $40. Since only the manager earns more than $10/hour, the median is a better measure of center in this case!

The mean is more affected by outliers in the sample. Another way to say this is that the median is more resistant.

If the total is of interest, then the mean is probably the appropriate summary value. A rancher would be interested in mean cattle weight, because total weight translates to profit.

In many other cases, the median does a better job of representing the center.

STAT360 (Wang) CH1 18

• The trimmed mean is a compromise between the mean and the median.

• A 10% trimmed mean is calculated by eliminating the smallest 10% and largest 10% of the sample and then averaging over what is left.

STAT360 (Wang)

CH1 19

Graphical Displays

• Pie Chart (Categorical) • Bar Graph (Categorical) • Stem and Leaf Plot (Numerical) • Histogram (Numerical) • Box Plot (Numerical)

STAT360 (Wang) CH1 20

Example: Students Majors

Major Frequency Relative FrequencyC&B Eng 6 0.082Eng Sc 3 0.041Math 13 0.178Mech Eng 37 0.507Other 11 0.151Physics 3 0.041Total 73 1.000

I recorded information about the majors of students in one class.

STAT360 (Wang)

Page 6: What is Statistics? - math.wsu.edu · STAT360 (Wang) CH1 3 CH1 4 A sample is a collection of persons or things on which we measure one or more variables. Example: 150 babies born

CH1 21

Pie ChartsPie charts are useful when only one categorical variable is observed.

CBE

Engineering Science

Mathematics

Mechanical Engineering

Other

Physics

STAT315: Majors

STAT360 (Wang) CH1 18

Bar ChartsBar Charts also show percentages or frequencies in various categories, but they can be used to represent multiple categorical variables simultaneously.

CBE Engineering Science Mathematics Mechanical Engineering Other Physics

Major

Freq

uenc

y

05

1015

2025

3035

STAT360 (Wang)

CH1 23

Stem and Leaf Plots

• Stem and leaf plots are used to represent a single numerical variable.

• They are typically used when the sample size is small.

• Procedure: 1. Sort the data. 2. Create the stems. 3. Add the leaves.

STAT360 (Wang) CH1 24

Example: Radish Growth

This data represents the length (in mm) of radishes grown in total darkness for three days.

15 20 11 30 3320 29 35 8 1022 37 15 25

STAT360 (Wang)

Page 7: What is Statistics? - math.wsu.edu · STAT360 (Wang) CH1 3 CH1 4 A sample is a collection of persons or things on which we measure one or more variables. Example: 150 babies born

CH1 25

1. Order the Data:

2. Choose the stems: Use the 10’s digit.3. Add the leaves.

0 |

1 |

2 |

3 |

Stem |Leaves

STAT360 (Wang) CH1 26

Histograms1.Divide the data up into a Frequency/Relative

Frequency table: - Divide the data into groups of equal width.

- Lists the frequency and relative frequency for each group. Frequency = total number in that group. Relative Frequency = percent of the total in that group.

2. Draw a bar corresponding to each row in the table, with height corresponding to frequency or relative frequency.

STAT360 (Wang)

CH1 27

Sorted Radish Data:

8, 10, 11, 15, 15, 20, 20, 22, 25, 29, 30, 33, 35, 37

Frequency Histogram:

Frequency and Relative Frequency Table:

Length Frequency Relative Frequency< 1010 - 1920 - 2930 - 39

Radish$Length

frequency

0 10 20 30 40

01

23

45

STAT360 (Wang) CH1 28

Frequency Distributions: Shapes and Examples

When investigators speak about the "shape" of the data, they are referring to the shape of the histogram resulting from the data.

Symmetric Data Sets: A data set for which the histogram is (approximately) symmetric.

Symmetric Histogram

0

2

4

6

8

10

STAT360 (Wang)

Page 8: What is Statistics? - math.wsu.edu · STAT360 (Wang) CH1 3 CH1 4 A sample is a collection of persons or things on which we measure one or more variables. Example: 150 babies born

CH1 29

A data set is referred to as unimodal if there is a single prominent peak in the histogram. An example is the Symmetric histogram (previous slide).

A data set is referred to as bimodal is there are two prominent peaks in the histogram.

Bimodal Histogram

0

2

4

6

8

10

STAT360 (Wang) CH1 30

A Skewed Data Set is one that is basically unimodal but is substantially off from being bell-shaped.

Skewed Left Histogram

0

2

4

6

8

10

12

Skewed Right Histogram

0

2

4

6

8

10

12

NOTE: The direction of the skew is in the direction of the long tail.

STAT360 (Wang)

CH1 31

Figure 1.11 shows “smoothed” histograms, obtained by superimposing a smooth curve on the rectangles, that illustrate the various possibilities.

(a) symmetric unimodal (b) bimodal

(c) Positively skewed (d) negatively skewed

STAT360 (Wang) CH1 32

The Five Number Summary and Boxplots

The Five Number Summary: - minimum (smallest value) - Q1 (first quartile) - median - Q3 (third quartile) - maximum (largest value)

Note that the terms (quartiles, interquartile range) and notation (Q1,Q3) used here are different that the book’s notation (fourths and fs).

STAT360 (Wang)

Page 9: What is Statistics? - math.wsu.edu · STAT360 (Wang) CH1 3 CH1 4 A sample is a collection of persons or things on which we measure one or more variables. Example: 150 babies born

CH1 33

• Divide the data into two equal high and low groups at the median.

• The median is included in both the upper or lower groups when n is odd.

• Find the median of the low group. This is called the lower quartile or Q1

•The median of the high group is the third quartile or Q3.

Upper and Lower Quartiles (Q1 and Q3)

STAT360 (Wang) CH1 34

Example: Radish Data

SORT THE DATA: 8, 10, 11, 15, 15, 20, 20, 22, 25, 29, 30, 33, 35, 37

Min =

Q1 =

Median=

Q3 =

Max =

NOTE: Some authors and computer packages define the quartiles differently. If n is large there is little practical difference between the definitions. When n is small the difference is noticeable.

STAT360 (Wang)

CH1 35

Boxplots A boxplot features the values from the five

number summary (plus some additional information).

1015

2025

3035

Length

Max = 37

Q3 = 30

Median = 21

Min = 8

Q1 = 15

STAT360 (Wang) CH1 36

Creating a Boxplot1. Draw a box with ends at the lower and upper quartiles. 2. Draw a line in the box at the median. 3. Compute the width of the box; this is the interquartile range: IQR= Q3 - Q1 4. Draw the whiskers at each end with length equal to 1.5 times the interquartile range (IQR); if the minimum or maximum occurs before the full length is used, stop there. 5. Use an asterisk to indicate any additional data points beyond the range covered by the box and whiskers.STAT360 (Wang)

Page 10: What is Statistics? - math.wsu.edu · STAT360 (Wang) CH1 3 CH1 4 A sample is a collection of persons or things on which we measure one or more variables. Example: 150 babies born

CH1 37

Example: 11 male college students were asked how many hours per week they exercise. The (sorted) responses were:

0, 1, 2, 2, 3, 4, 4.5, 5, 6, 8, 15

The five number summary is: Min= 0, Q1=2, Median=4, Q3=5.5, Max=15

STAT360 (Wang) CH1 38

0 5 10 15Hours of Exercise

1. and 2. Draw a box with lines at Q1, median and Q3.

3. Calculate the IQR= Q3-Q1=

5. Add asterisks for outliers.

4. Whiskers…1.5xIQR =

Lower whisker to Q1-1.5*IQR or min =

Upper whisker to Q3+1.5*IQR or max =

Q1=2 M=4 Q3=5.5min=0 max=15

STAT360 (Wang)

CH1 39

Here we use side-by-side boxplots to compare the distribution of length for radishes grown in three days of darkness versus those grown in three days of light.

STAT360 (Wang) CH1 40

Measures of VariabilityThe mean and median for the distributions shown here are nearly the same. However, the spread in greater for Histogram #2.

How do we summarize the variability?

STAT360 (Wang)

Page 11: What is Statistics? - math.wsu.edu · STAT360 (Wang) CH1 3 CH1 4 A sample is a collection of persons or things on which we measure one or more variables. Example: 150 babies born

CH1 41

Two simple measures of spread are:

• Range = Max – Min

• Interquartile Range (IQR) = Q3 – Q1

STAT360 (Wang)

Descriptive Statistics: Measures of Spread

CH1 42

Variance and Standard Deviation

• Variance and standard deviation are both measures of spread.

• The sample variance is computed as:

• The sample standard deviation is computed as:

11)( 2

2

−=

−= ∑

nS

nxx

s xxi

2ss =

STAT360 (Wang)

CH1 43

• The standard deviation is like the “average” distance to the mean.

• As the data spreads out, the variance and standard deviation increase.

• As the data becomes more concentrated, the variance and standard deviation decrease.

STAT360 (Wang) CH1 44

We will compute the variance and standard deviation for a very small sample (n=6):

2, 3, 5, 6, 7, 7

The mean of these values is

2 -3 93 -2 45 0 06 1 17 2 47 2 4

x xx − 2)( xx −

.5=x

22)( 2 =−∑ xxi

4.4522

1)( 2

2 ==−

−= ∑

nxx

s i

10.24.42 === ssSo, the sample variance is 4.4 and the sample standard deviation is 2.10

STAT360 (Wang)

Page 12: What is Statistics? - math.wsu.edu · STAT360 (Wang) CH1 3 CH1 4 A sample is a collection of persons or things on which we measure one or more variables. Example: 150 babies born

CH1 45

Computing Formula for s2

It is best to obtain s2 from statistical software or else use a calculator that allows you to enter data into memory and then view s2 with a single keystroke. If your calculator does not have this capability, there is an alternative formula for Sxx that avoids calculating the deviations.

The formula involves both summing and then squaring, and squaring and then summing.

An alternative expression for the numerator of s2 is

STAT360 (Wang) STAT360 (Wang) CH1 46

Key Words

population, sample, variable of interest population parameter, sample statistics, measure of center, measure of spread Graphical Displays

47 CH1 48

Inferential Statistics

Inferential Statistics is the process of drawing conclusions about a population, based on observations in the sample from that population.

STAT360 (Wang)

Page 13: What is Statistics? - math.wsu.edu · STAT360 (Wang) CH1 3 CH1 4 A sample is a collection of persons or things on which we measure one or more variables. Example: 150 babies born

CH1 49

Describing a PopulationBecause observations are made only on a sample, characteristics of biological populations are almost never exactly known. Typically, our knowledge of population characteristics come from a sample. Just as each sample has a distribution, mean and standard deviation, so also we can envision a population distribution, mean, and standard deviation. Population characteristics are called parameters.

STAT360 (Wang) CH1 50

Suppose that a researcher is interested in estimating the proportion of American adults who have diabetes. Based on a sample of 1000 individuals they found that 72 are diabetic.

What proportion of adults in the sample are diabetic?

What is the (target) population?

Do you think that exactly 7.2% of American adults have diabetes?

STAT360 (Wang)

CH1 51

MeasureSample Value (Estimate)

Population Value (Parameter)

Proportion (p hat)

Mean (x bar) (mu)

Standard deviation

s (sigma)xp̂ p

µσ

STAT360 (Wang)