introduction to statistics - part 2

78
Quantitative Data Analysis: Statistics – Part 2

Upload: damian-gordon

Post on 03-Dec-2014

1.021 views

Category:

Education


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Introduction to Statistics - Part 2

Quantitative Data Analysis: Statistics – Part 2

Page 2: Introduction to Statistics - Part 2

Overview

Part 1 Picturing the Data Pitfalls of Surveys Averages Variance and Standard Deviation

Part 2 The Normal Distribution Z-Tests Confidence Intervals T-Tests

Page 3: Introduction to Statistics - Part 2

The Normal Distribution

Page 4: Introduction to Statistics - Part 2

The Normal Distribution

Abraham de Moivre, the 18th century statistician and consultant to gamblers was often called upon to make lengthy computations about coin flips. de Moivre noted that when the number of events (coin flips) increased, the shape of the binomial distribution approached a very smooth curve.

In 1809 Carl Gauss developed the formula for the normal distribution and showed that the distribution of many natural phenomena are at least approximately normally distributed.

Page 5: Introduction to Statistics - Part 2

Abraham de Moivre Born 26 May 1667 Died 27 November 1754 Born in Champagne, France wrote a textbook on

probability theory, "The Doctrine of Chances: a method of calculating the probabilities of events in play". This book came out in four editions, 1711 in Latin, and 1718, 1738 and 1756 in English.

In the later editions of his book, de Moivre gives the first statement of the formula for the normal distribution curve.

Page 6: Introduction to Statistics - Part 2

Carl Friedrich Gauss

Born 30 April 1777 Died 23 February 1855 Born in Lower Saxony, Germany In 1809 Gauss published the

monograph “Theoria motus corporum coelestium in sectionibus conicis solem ambientium” where among other things he introduces and describes several important statistical concepts, such as the method of least squares, the method of maximum likelihood, and the normal distribution.

Page 7: Introduction to Statistics - Part 2
Page 8: Introduction to Statistics - Part 2
Page 9: Introduction to Statistics - Part 2

The Normal Distribution

Page 10: Introduction to Statistics - Part 2

The Normal Distribution

Age of students in a class Body temperature Pulse rate Shoe size IQ score Diameter of trees Height?

Page 11: Introduction to Statistics - Part 2

The Normal Distribution

Page 12: Introduction to Statistics - Part 2

The Normal Distribution

Page 13: Introduction to Statistics - Part 2
Page 14: Introduction to Statistics - Part 2

Density Curves: Properties

Page 15: Introduction to Statistics - Part 2

The Normal Distribution

The graph has a single peak at the center, this peak occurs at the mean

The graph is symmetrical about the mean

The graph never touches the horizontal axis

The area under the graph is equal to 1

Page 16: Introduction to Statistics - Part 2

Characterization

A normal distribution is bell-shaped and symmetric.

The distribution is determined by the mean mu, and the standard deviation sigma, .

The mean mu controls the center and sigma controls the spread.

Page 17: Introduction to Statistics - Part 2

Same Mean, Different Standard Deviation

101

Page 18: Introduction to Statistics - Part 2

Different Mean, Different Standard Deviation

101

Page 19: Introduction to Statistics - Part 2

Different Mean, Same Standard Deviation

101

Page 20: Introduction to Statistics - Part 2
Page 21: Introduction to Statistics - Part 2
Page 22: Introduction to Statistics - Part 2
Page 23: Introduction to Statistics - Part 2
Page 24: Introduction to Statistics - Part 2
Page 25: Introduction to Statistics - Part 2
Page 26: Introduction to Statistics - Part 2
Page 27: Introduction to Statistics - Part 2
Page 28: Introduction to Statistics - Part 2
Page 29: Introduction to Statistics - Part 2

The Normal Distribution

If a variable is normally distributed, then: within one standard deviation of the mean there

will be approximately 68% of the data within two standard deviations of the mean there

will be approximately 95% of the data within three standard deviations of the mean

there will be approximately 99.7% of the data

Page 30: Introduction to Statistics - Part 2

The Normal Distribution

Page 31: Introduction to Statistics - Part 2

Why?

One reason the normal distribution is important is that many psychological and organsational variables are distributed approximately normally. Measures of reading ability, introversion, job satisfaction, and memory are among the many psychological variables approximately normally distributed. Although the distributions are only approximately normal, they are usually quite close.

Page 32: Introduction to Statistics - Part 2

Why?

A second reason the normal distribution is so important is that it is easy for mathematical statisticians to work with. This means that many kinds of statistical tests can be derived for normal distributions. Almost all statistical tests discussed in this text assume normal distributions. Fortunately, these tests work very well even if the distribution is only approximately normally distributed. Some tests work well even with very wide deviations from normality.

Page 33: Introduction to Statistics - Part 2

So what?

Imagine we undertook an experiment where we measured staff productivity before and after we introduced a computer system to help record solutions to common issues of work

Average productivity before = 6.4 Average productivity after = 9.2

Page 34: Introduction to Statistics - Part 2

So what?

Before = 6.4 After = 9.20 10

Page 35: Introduction to Statistics - Part 2

So what?

Before = 6.4 After = 9.2 100

Page 36: Introduction to Statistics - Part 2

So what?

Before = 6.4 After = 9.2 100

Page 37: Introduction to Statistics - Part 2

So what?

Before = 6.4 After = 9.2 100

Page 38: Introduction to Statistics - Part 2

So what?

Before = 6.4 After = 9.2 100

Page 39: Introduction to Statistics - Part 2

So what?

Before = 6.4 After = 9.2 100

Page 40: Introduction to Statistics - Part 2

So what?

Before = 6.4 After = 9.2 100

Page 41: Introduction to Statistics - Part 2

So what?

Before = 6.4 After = 9.2 100

σσσ

Page 42: Introduction to Statistics - Part 2

One Tail / Two Tail

One-Tailed H0 : m1 >= m2 HA : m1 < m2

Two-Tailed H0 : m1 = m2 HA : m1 <>m2

Page 43: Introduction to Statistics - Part 2

STANDARD NORMAL DISTRIBUTION

Normal Distribution is defined as N(mean, (Std dev)^2) Standard Normal Distribution is defined as N(0, (1)^2)

Page 44: Introduction to Statistics - Part 2

STANDARD NORMAL DISTRIBUTION

Using the following formula :

will convert a normal table into a standard normal table.

Page 45: Introduction to Statistics - Part 2

Exercise

If the average IQ in a given population is 100, and the standard deviation is 15, what percentage of the population has an IQ of 145 or higher ?

Page 46: Introduction to Statistics - Part 2

Answer

P(X >= 145) P(Z >= ((145 - 100)/15)) P(Z >= 3) From tables: 99.87% are less than 3

=> 0.13% of population

Page 47: Introduction to Statistics - Part 2

Trends in Statistical Tests used in Research Papers

Historically Currently

Results in:Accept/Reject

Results in:p-Value

Results in:Approx. Mean

Page 48: Introduction to Statistics - Part 2

Confidence Intervals 

A confidence interval is used to express the uncertainty in a quantity being estimated. There is uncertainty because inferences are based on a random sample of finite size from a population or process of interest. To judge the statistical procedure we can ask what would happen if we were to repeat the same study, over and over, getting different data (and thus different confidence intervals) each time.

Page 49: Introduction to Statistics - Part 2

Confidence Intervals 

Page 50: Introduction to Statistics - Part 2

Born April 16, 1894 Died August 5, 1981 Born in Bessarabia, Imperial Russia statistician who spent most of his professional career at the University of California,

Berkeley. Developed modern scientific sampling (random samples) in 1934, the Neyman-

Pearson lemma in 1933 and the confidence interval in 1937.

Jerzy Neyman

Page 51: Introduction to Statistics - Part 2

Born 11 August 1895 Died 12 June 1980 Born in Hampstead, London Son of Karl Pearson Leading British statistician Developed the Neyman-Pearson lemma in 1933.

Egon Pearson

Page 52: Introduction to Statistics - Part 2

Neyman and Pearson's joint work formally started in the spring of 1927. From 1928 to 1934, they published several important papers on the theory of testing statistical hypotheses. In developing their theory, Neyman and Pearson recognized the need to include alternative hypotheses and they perceived the errors in testing hypotheses concerning

unknown population values based on sample observations that are subject to variation. They called the error of rejecting a true hypothesis the first kind of error and the error of accepting a false hypothesis the second kind of error. They called a hypothesis that completely specifies a probability distribution a simple hypothesis. A hypothesis that is not a simple hypothesis is a composite hypothesis. Their joint work lead to Neyman developing the idea of confidence interval estimation, published in 1937.

Page 53: Introduction to Statistics - Part 2

Confidence Intervals 

Neyman, J. (1937) "Outline of a theory of statistical estimation based on the classical theory of probability" Philos. Trans. Roy. Soc. London. Ser. A. , Vol. 236 pp. 333–380.

Page 54: Introduction to Statistics - Part 2

Confidence Intervals 

If we know the true population mean and sample n individuals, we know that if the data is normally distributed, Average mean of these n samples has a 95% chance of falling into the interval.

Page 55: Introduction to Statistics - Part 2

Confidence Intervals 

where the standard error for a 95% CI may be calculated as follows;

Page 56: Introduction to Statistics - Part 2

Example 1

Page 57: Introduction to Statistics - Part 2

Example 1

Did FF have more of the popular vote than FG-L ? In a random sample of 721 respondents :

382 FF 339 FG-L

Can we conclude that FF had more than 50% of the popular vote ?

Page 58: Introduction to Statistics - Part 2

Example 1 - Solution

Sample proportion = p = 382/721 = 0.53 Sample size = n = 721 Standard Error = (SqRt((p(1-p)/n))) = 0.02

95% Confidence Interval 0.53 +/- 1.96 (0.02) 0.53 +/- 0.04 [0.49, 0.57] Thus, we cannot conclude that FF had more of the

popular vote, since this interval spans 50%. So, we say: "the data are consistent with the hypothesis that there is no difference" 

Page 59: Introduction to Statistics - Part 2

Example 2

Page 60: Introduction to Statistics - Part 2

Example 2

Did Obama have more of the popular vote than McCain ? In a random sample of 1000 respondents

532 Obama 468 McCain

Can we conclude that Obama had more than 50% of the popular vote ?

Page 61: Introduction to Statistics - Part 2

Example 2 – 95% CI

Sample proportion = p = 532/1000 = 0.532 Sample size = n = 1000 Standard Error = (SqRt((p(1-p)/n))) = 0.016

95% Confidence Interval 0.532 +/- 1.96 (0.016) 0.532 +/- 0.03136 [0.5006, 0.56336] Thus, we can conclude that Obama had more of the

popular vote, since this interval does not span 50%. So, we say : "the data are consistent with the hypothesis that there is a difference in a 95% CI" 

Page 62: Introduction to Statistics - Part 2

Example 2 – 99% CI

Sample proportion = p = 532/1000 = 0.532 Sample size = n = 1000 Standard Error = (SqRt((p(1-p)/n))) = 0.016

99% Confidence Interval 0.532 +/- 2.58 (0.016) 0.532 +/- 0.041 [0.491, 0.573] Thus, we cannot conclude that Obama had more of

the popular vote, since this interval does span 50%. So, we say : "the data are consistent with the hypothesis that there is no difference in a 99% CI" 

Page 63: Introduction to Statistics - Part 2

Example 2 – 99.99% CI

Sample proportion = p = 532/1000 = 0.532 Sample size = n = 1000 Standard Error = (SqRt((p(1-p)/n))) = 0.016

99.99% Confidence Interval 0.532 +/- 3.87 (0.016) 0.532 +/- 0.06 [0.472, 0.592] Thus, we cannot conclude that Obama had more of the

popular vote, since this interval does span 50%. So, we say : "the data are consistent with the hypothesis that there is no difference in a 99.99% CI" 

Page 64: Introduction to Statistics - Part 2

T-Tests

Page 65: Introduction to Statistics - Part 2

William Sealy Gosset Born June 13, 1876 Died October 16, 1937 Born in Canterbury,

England On graduating from

Oxford in 1899, he joined the Dublin brewery of Arthur Guinness & Son.

Published significant paper in 1908 concerning the t-distribution

Page 66: Introduction to Statistics - Part 2

Gosset acquired his statistical knowledge by study, and he also spend two terms in 1906–1907 in the biometric laboratory of Karl Pearson. Gosset applied his knowledge for Guinness both in the brewery and on the farm - to the selection of the best yielding varieties of barley, and to

compare the different brewing processes for changing raw materials into beer. Gosset and Pearson had a good relationship and Pearson helped Gosset with the mathematics of his papers. Pearson helped with the 1908 paper but he had little appreciation of their importance. The papers addressed the brewer's concern with small samples, while the biometrician typically had hundreds of observations and saw no urgency

in developing small-sample methods.

Page 67: Introduction to Statistics - Part 2

T-Tests

Student (1908), “The Probable Error of a Mean” Biometrika, Vol. 6, No. 1, pp.1-25.

Page 68: Introduction to Statistics - Part 2

T-Tests

Guinness did not allow its employees to publish results but the management decided to allow Gossett to publish it under a pseudonym - Student. Hence we have the Student's t-test.

Page 69: Introduction to Statistics - Part 2

T-Tests

powerful parametric test for calculating the significance of a small sample mean

necessary for small samples because their distributions are not normal

one first has to calculate the "degrees of freedom"

Page 70: Introduction to Statistics - Part 2
Page 71: Introduction to Statistics - Part 2

 

~ THE GOLDEN RULE ~

Use the t-Test when your

sample size is less than 30

Page 72: Introduction to Statistics - Part 2

T-Tests

If the underlying population is normal If the underlying population is not skewed

and reasonable to normal (n < 15) If the underlying population is skewed and

there are no major outliers (n > 15) If the underlying population is skewed and

some outliers (n > 24)

Page 73: Introduction to Statistics - Part 2

T-Tests

Form of Confidence Interval with t-Value

Mean +/- tValue * SE

-------- -------

as before as before

Page 74: Introduction to Statistics - Part 2

Two Sample T-Test: Unpaired Sample

Consider a questionnaire on computer use to final year undergraduates in year 2007 and the same questionnaire give to undergraduates in 2008. As there is no direct one-to-one correspondence between individual students (in fact, there may be different number of students in different classes), you have to sum up all the responses of a given year, obtain an average from that, down the same for the following year, and compare averages.

Page 75: Introduction to Statistics - Part 2

Two Sample T-Test: Paired Sample

If you are doing a questionnaire that is testing the BEFORE/AFTER effect of parameter on the same population, then we can individually calculate differences between each sample and then average the differences. The paired test is a much strong (more powerful) statistical test.

Page 76: Introduction to Statistics - Part 2

Choosing the right test

Page 77: Introduction to Statistics - Part 2

Choosing a statistical test

http://www.graphpad.com/www/Book/Choose.htm

Page 78: Introduction to Statistics - Part 2

Choosing a statistical test

http://www.graphpad.com/www/Book/Choose.htm