last lecture summary

29
Last lecture summary Five numbers summary, percentiles, mean Box plot, modified box plot Robust statistic – mean, median, trimmed mean outlier Measures of variability range, IQR

Upload: merton

Post on 04-Jan-2016

56 views

Category:

Documents


3 download

DESCRIPTION

Last lecture summary. Five numbers summary, percentiles, mean Box plot, modified box plot Robust statistic – mean, median, trimmed mean outlier Measures of variability range, IQR. measures of VARIABILITY. Problem with IQR. normal. bimodal. uniform. Options for measuring variability. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Last lecture summary

Last lecture summary• Five numbers summary, percentiles, mean• Box plot, modified box plot• Robust statistic – mean, median, trimmed mean

• outlier

• Measures of variability• range, IQR

Page 2: Last lecture summary

MEASURES OF VARIABILITY

Page 3: Last lecture summary

Problem with IQR

normal

bimodal

uniform

Page 4: Last lecture summary

Options for measuring variability

1. Find the average distance between all pairs of data values.

2. Find the average distance between each data value and either the max or the min.

3. Find the average distance between each data value and the mean.

Page 5: Last lecture summary

Preventing cancellation• How can we prevent the negative and positive deviations

from cancelling each out?1. Take absolute value of each deviation.

2. Square each deviation.

Page 6: Last lecture summary

Average absolute deviationSample Deviation from mean Absolute deviation

10 4 4

5 -1 1

3 -3 3

2 -4 4

19 13 13

1 -5 5

7 1 1

11 5 5

1 -5 5

1 -5 5

avg. absolute deviation = 4.6

Page 7: Last lecture summary

Average absolute deviation

What formulas describes what you just did?

Page 8: Last lecture summary

Squared deviationsSample Deviation from

mean Squared deviation

10 4

5 -1

3 -3

2 -4

19 13

1 -5

7 1

11 5

1 -5

1 -5

Page 9: Last lecture summary

Squared deviationsSample Deviation from

mean Squared deviation

10 4 16

5 -1 1

3 -3 9

2 -4 16

19 13 169

1 -5 25

7 1 1

11 5 25

1 -5 25

1 -5 25 avg. square deviation = 31.2

SS, sum of squares(čtverce odchylek)

Page 10: Last lecture summary

Variance

Average squared devation has a special name – variance (rozptyl).

Page 11: Last lecture summary

Standard deviation• směrodatná odchylka,

• Which symbol would you use for a variance?

Page 12: Last lecture summary

Standard deviation• What is so great about the standard deviation? Why don’t

we just find the average absolute deviation?

More on absolute vs. standard deviation: http://www.leeds.ac.uk/educol/documents/00003759.htm

Empirical rule

68% - 1 s.d.95% - 2 s.d.99.7% - 3 s.d.

Page 13: Last lecture summary

Empirical rule

, ?

It covers 273 data values, 66.8%.

covers 380 data values, 95%. covers 397 data values, 99.3%.

Page 14: Last lecture summary

Empirical rule

197 countries

65% within 1 s.d.

94.7 within 2 s.d.

100% within 3 s.d.

Page 15: Last lecture summary

Statistical inference• The goal of statistical work: make rational conclusions or

decisions based on the incomplete information we have in our data.

• This process is known as statistical inference. • In inferential statistics we want to be able to answer the

question: “If I see something in my data, say a difference between two groups or a relationship between two variables, could this be simply due to chance? Or is it a real difference in relationship?”

Page 16: Last lecture summary

Statistical inference• If we get results that we think are not just due to chance

we'd like to know what broader conclusions we can make. Can we generalize them to a larger group or even perhaps the whole world?

• And when we see a relationship between two variables, we'd like to know if one variable causes the other to change.

• The methods we use to do so and the correctness of the conclusions that we can make all depend on how the data were collected.

Page 17: Last lecture summary

Statistical inference• fundamental feature of data: variability• How can we picture this variation and how can we

quantify it?

• Population – the group we are interested in making conclusions about.

• Census – a collection of data on the entire population.• Sample – if we can’t conduct a census, we collect data

from the sample of a population. Goal: make conclusions about that population.

Page 18: Last lecture summary

Statistical inference• A statistic is a value calculated from our observed data

(sample).

• A parameter is a value that describes the population.

• We want to be able to generalize what we observe in our data to our population. In order to this, the sample needs to be representative.

• How to select a representative sample? Use randomization.

Page 19: Last lecture summary
Page 20: Last lecture summary

Population - parameterMean Standard deviation

Sample - statisticMean Standard deviation

Výběr - statistikaVýběrový průměr Výběrová směrodatná odchylka

population (census) vs. sample

parameter (population) vs. statistic (sample)

Page 21: Last lecture summary

Random sampling• Simple Random Sampling (SRS) – each possible

sample from the population is equally likely to be selected.

• Stratified Sampling – simple random sample from subgroups of the population• subgroups: gender, age groups, …

• Cluster sampling – divide the population into non-overlapping groups (clusters), sample is a randomly chosen cluster• example: population are all students in an area, randomly select

schools and create a sample from students of the given school

Page 22: Last lecture summary

Bias• If a sample is not representative, it can introduce bias into

our results.• bias – zkreslení, odchylka• A sample is biased if it differs from the population in a

systematic way.

• The Literary Digest poll, 1936, U. S. presidential election• surveyed 10 mil. people – subscribers or owned cars or telephones• 2.3 mil. responded predicting (3:2) a Republican candidate to win• a Democrat candidate won• What went wrong?

• only wealthy people were surveyed (selection bias)• survey was voluntary response (nonresponse bias) – angry people or

people who want a change

Page 23: Last lecture summary

Bessel’s correction

𝑠=√∑ (𝑥𝑖−𝑥 )2

𝑛−1

Page 24: Last lecture summary

Sample vs. population SD• We use sample standard deviation to approximate

population paramater

• But don’t get confused with the actual standard deviation of a small dataset.

• For example, let’s have this dataset: 5 2 1 0 7. Do you divide by or by ?

Page 25: Last lecture summary

• Suppose you have a bag with 3 cards in it. The cards are numbered 0, 2 and 4.

• What is the population mean? And the population variance?

• An important property of a sample statistic that estimates a population parameter is that if you evaluate the sample statistic for every possible sample and average them all, the average of the sample statistic should equal the population parameter.

We want: • This is called unbiased.

Page 26: Last lecture summary

SRS• sampling with replacement

• Generates independent samples• Two sample values are independent if that what we get on the first

one doesn't affect what we get on the second.

• sampling without replacement• Deliberately avoid choosing any member of the population more

than once.• This type of sampling is not independent, however it is more

common.• The error is small as long as

1. the sample is large

2. the sample size is no more than 10% of population size

Page 27: Last lecture summary

Bessel’s game• Now list all possible samples of 2 cards.• Calculate sample averages.• Now, half of you calculate sample

variance using /n, and half of youusing /(n-1).

• And then average all sample variances.

SampleSample average

0 4

Population of all cards in a bag

2

Page 28: Last lecture summary

Measuring spread – summary• median = $112 000• mean = $518 000• trimmed median = $112 000• trimmed mean = $128 000

33 750

33 750

33 750

33 750

44 000

44 000

44 000

44 000

45 566

65 000

95 000

103 500

112 495

138 188

141 666

181 500

185 000

190 000

194 375

195 000

205 000

292 500

301 999

4 600 000

5 600 000

Page 29: Last lecture summary

Measuring spread – summary

original data trimmed data robust

median $112 000 $112 000

mean $518 000 $ 128 000

range $5 566 000 $268 000

IQR $150 000 $146 000

s.d. $1 389 000 $84 000

33 750

33 750

33 750

33 750

44 000

44 000

44 000

44 000

45 566

65 000

95 000

103 500

112 495

138 188

141 666

181 500

185 000

190 000

194 375

195 000

205 000

292 500

301 999

4 600 000

5 600 000