gathering and producing data. how data are obtained census –everyone is included observational...

36
GATHERING AND PRODUCING DATA

Post on 20-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

GATHERING AND

PRODUCINGDATA

Page 2: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

How Data are Obtained• Census

– Everyone is included

• Observational Study– Observes individuals and measures variables but

does not attempt to influence responses

– Includes surveys and polls

• Experiment– Deliberately imposes some treatment on individuals

in order to observe their responses

– In medicine, this is called a clinical trial

Page 3: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

3 BIG ideas 1. Examine a part of

the whole: take a sample from a population

2. Randomization insures the sample is representative

3. The size of the sample is what’s important, not the size of the population

Page 4: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Big Idea #1: Examine Part of the Whole• We are studying an entire population of individuals (or

subjects), but looking at everyone is practically impossible. – How many support the U.S. role in Iraq? – What percent of the tomato shipment is bad?– How many children are obese?– What’s the price of gas at the pump across Minnesota?

• Settle for looking at a smaller group—a sample—selected from the population.

• Sampling is natural! Think about cooking. You taste (sample) a small part to get an idea about the dish as a whole.

Page 5: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Populations and parameters, samples and statistics

(This stuff is important!)• A parameter is a numerical quantity that describes a

population.• A statistic is a numerical quantity that describes the

sample.• We study a population by looking at a sample. We infer

about a parameter by using statistics from the sample.• Notation: use Greek letters for parameters and Latin letters

for statistics

Page 6: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

Example: PollingExample: PollingMinneapolis Star Tribune:Minneapolis Star Tribune: “A Gallup Poll, conducted Aug. “A Gallup Poll, conducted Aug. 16-18, 1999, asked, ‘Do you consider pro-wrestling to be 16-18, 1999, asked, ‘Do you consider pro-wrestling to be a sport, or not?’ Of the people polled, 19% said, “Yes.” a sport, or not?’ Of the people polled, 19% said, “Yes.” (Results were based on telephone interviews with a (Results were based on telephone interviews with a randomly selected national sample of 1,028 adults, 18 randomly selected national sample of 1,028 adults, 18 years and older.)”years and older.)”

What’s the population, parameter, sample, statistic?What’s the population, parameter, sample, statistic? Population:Population: Americans, 18 years and older Americans, 18 years and older Sample:Sample: The 1,028 people who were polled The 1,028 people who were polled Parameter:Parameter: The proportion of American adults who The proportion of American adults who

believe pro-wrestling is a sport. (Called the believe pro-wrestling is a sport. (Called the population population proportionproportion.).)

pp = ? = ? Statistic:Statistic: The proportion of people in the sample who The proportion of people in the sample who

said they believe pro-wrestling is a sport. (Called the said they believe pro-wrestling is a sport. (Called the sample proportionsample proportion.).) = 0.19= 0.19p̂

Page 7: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

Example: Surveying a lot shipmentExample: Surveying a lot shipment

A carload of ball bearings has an average diameter of A carload of ball bearings has an average diameter of 2.502 centimeters. This is within the specifications for 2.502 centimeters. This is within the specifications for acceptance of the lot by the purchaser. An inspector acceptance of the lot by the purchaser. An inspector happens to inspect 100 bearings from the lot and finds happens to inspect 100 bearings from the lot and finds the average diameter of these to be 2.499 cm. This is the average diameter of these to be 2.499 cm. This is within the specified limits, so the entire lot is accepted.within the specified limits, so the entire lot is accepted.

What’s the population, parameter, sample, statistic?What’s the population, parameter, sample, statistic? Population:Population: The carload of ball bearings The carload of ball bearings Sample:Sample: The 100 ball bearings that were inspected The 100 ball bearings that were inspected Parameter:Parameter: The average diameter of the ball bearings The average diameter of the ball bearings

in the carload. in the carload. µµ = 2.502 cm (The = 2.502 cm (The population meanpopulation mean.).)

Statistic:Statistic: The average diameter of the 100 ball The average diameter of the 100 ball bearings in the sample. bearings in the sample.

= 2.499 cm (The = 2.499 cm (The sample meansample mean.).)y

Page 8: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Big Idea #2: Randomization

• Randomization makes sure that

on average the sample looks like

the rest of the population.

• Randomization makes it possible to use quantitative tools (probability) to draw inferences about the population when we see only a sample.

• Randomization protects against bias.

Page 9: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

““Who will you vote for in 2008?” Who will you vote for in 2008?” Some examples of biased samplesSome examples of biased samples

100 people at the Mall of America100 people at the Mall of America 100 people in front of the Metrodome after a 100 people in front of the Metrodome after a

Twins gameTwins game 100 friends, family and relatives100 friends, family and relatives 100 people who volunteered to answer a survey 100 people who volunteered to answer a survey

question on your web sitequestion on your web site 100 people who answered their phone during 100 people who answered their phone during

supper timesupper time The first 100 people you see after you wake up The first 100 people you see after you wake up

in the morningin the morning

Page 10: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Bias – the bane of sampling

• Samples that systematically misrepresent individuals in the population are said to be biased.

• Bias is the systematic failure of a sample to represent its population

• There is usually no way to fix a biased sample and no way to salvage useful information from it.

• The best way to avoid bias is to select individuals for the sample at random. The value of deliberately introducing randomness is one of the great insights of Statistics.

Page 11: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Simple Random Sample (SRS)

• Suppose we want to draw a sample of size n from some population

• For a simple random sample, every possible subset of size n has an equal chance to be selected and to become the sample.– Such samples guarantee that each individual has an

equal chance of being selected.– Each combination of people also has an equal chance of

being selected.• The sampling frame is a list of the population

from which the sample is drawn. From the sampling frame, we can choose a SRS using random numbers.

Page 12: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

SRS and Sampling Variability• Samples drawn at random generally differ from

one another. • These differences lead to different values for the

variables we measure.• Sample-to-sample differences are called sampling

variability• This is different from bias!• Example: Everyone pick 10 Skittles at random

from “The Bowl” and count how many reds. – The variability of the different sample counts is

sampling variability.– If half the class peeked and tried to get more reds the

differences would reflect bias.

Page 13: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Sources of sampling error• In the context of using a sample to

estimate a population parameter, sampling variability is sometimes called “sampling error.”

• Taking a SRS of 3 students to estimate the average height of all students will have a large sampling error, but it is not biased.

• Taking a sample of 300 basketball players to estimate the average height of all students will produce less variability but the sample is biased.

Page 14: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

More complex sampling designs

• Simple random sampling is not the only way to sample.

• More complicated designs may save time or money or help avoid sampling problems.– Stratified sampling– Cluster sampling– Systematic sampling– Multi-stage sampling

• All statistical sampling designs have in common the idea that chance, rather than human choice, is used to select the sample.

Page 15: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Stratified sampling• Suppose we want a sample of 240 Carleton students• We also want to insure discipline representation• The student body divides as

– Arts and Literature 20%– Humanities 15%– Social Sciences 30%– Mathematics and Natural Sciences 35%

• For the sample, select 240 x .20 = 48 Arts and Lit students

240 x .15 = 36 Humanities students 240 x .30 = 72 Social science students 240 x .35 = 84 Natural science students

• Within each discipline, choose a SRS

Page 16: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Stratified Sampling• The population is divided into homogeneous

groups, called strata, before the sample is selected.• Then simple random sampling is used within each

stratum before the results are combined.• Advantages

– Sample will be representative for the strata– Reduces sampling variability

• Disadvantages– May be logistically difficult if even possible to

implement– Must have information about the population

• Note: a stratified sample is not a SRS

Page 17: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Cluster sampling• Sometimes stratifying isn’t practical and simple random

sampling is difficult. Splitting the population into clusters can make sampling more practical.

• Suppose you want to do a face-to-face survey of attitudes in Minnesota based on a sample of size 600.

• Choosing 600 people at random, finding their addresses, and meeting them in person is costly and time-consuming.

• Another idea: Choose some cities at random. Then some streets at random, and then some blocks at random. Interview everyone on the selected blocks.

• The blocks are the clusters. • If you know there are about 20 people per block. Then

choose a random sample of 30 blocks.

Page 18: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Cluster sampling in the news:The Lancet study on Iraq casualties

• In October 2006, The Lancet published “Iraq mortality after the 2003 invasion: a cross-sectional cluster sample survey”

• The study was controversial because of its findings that hundreds of thousands of Iraqis (most likely about 650,000) had been killed since the U.S. invasion.

• Earlier reports, including the U.S. and British government had put the number at about 30,000.

• The study was based on cluster sampling, a common methodology in public health and human rights work

• The clusters were groups of 40 houses in close proximity whose locations were chosen based on population demographics.

Page 19: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Cluster Sampling

• If each cluster fairly represents the population, cluster sampling will give an unbiased sample.

• Advantage– Easier to implement depending on context

• Disadvantage– Greater sampling variability, so less statistical

accuracy

Page 20: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Multistage Sampling• Most surveys conducted by the government or professional

polling organizations use some combination of stratified and cluster sampling as well as simple random sampling.

• Current Population Survey is how the government estimates the unemployment rate

• Counties are divided into 2,007 Primary Sampling Units

• PSUs are divided into smaller census blocks. And the blocks are grouped into strata. Households in each block are grouped into clusters of about 4 households each

• The final sample consists of these clusters and interviewers go to all households in the chosen clusters.

Page 21: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Systematic Samples• Sometimes we draw a sample by selecting individuals

systematically.– For example, you might survey every 10th person on an

alphabetical list of students.• To make it random, you must still start the systematic

selection from a randomly selected individual.• When there is no reason to believe that the order of the list

could be associated in any way with the responses sought, systematic sampling can give a representative sample.

• Systematic sampling can be much less expensive than true random sampling.

Page 22: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

Sampling ExampleSampling ExampleHospital administrators are concerned about the Hospital administrators are concerned about the possibility of drug abuse among employees. possibility of drug abuse among employees. They plan to pick a sample of 40 from 800 They plan to pick a sample of 40 from 800 employees, and administer a drug test. What’s employees, and administer a drug test. What’s the sampling strategy?the sampling strategy? Randomly select 10 doctors, 10 nurses, 10 office Randomly select 10 doctors, 10 nurses, 10 office

staff, and 10 support staff for the test.staff, and 10 support staff for the test. Each employee has a 4-digit ID number. Randomly Each employee has a 4-digit ID number. Randomly

choose 40 numbers.choose 40 numbers. At the start of each shift, choose every 20At the start of each shift, choose every 20thth person person

who arrives for work.who arrives for work. There are 40 departments of 20 employees each. There are 40 departments of 20 employees each.

Randomly choose two departments (say radiology Randomly choose two departments (say radiology and ER) and test all the people who work in that and ER) and test all the people who work in that department.department.

Page 23: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Big Idea #3: Sample size is key, not population size

• How large a sample size do we need for the sample to be reasonably representative of the population?

• In general, it’s the size of the sample, not the size of the population, that makes the difference in sampling.

• The fraction of the population that you’ve sampled doesn’t matter. It’s the sample size itself that’s important

• Back to cooking: If the soup is mixed enough a tablespoon will suffice, whether you’re “sampling” from a saucepan or from a barrel.

Page 24: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

How big a sample?

• Most professional polls choose a sample size of about 1,000 people.

• These polls report a “margin of error” of about 3%. That means that with “high confidence” their estimates are within 3% of the true population parameter value.

• The margin of error for a sample of 1,000 people is the same for Minneapolis (pop. 400,000), Minnesota (pop. 5 million), and the U.S. (pop. 290 million)

• But the bad news is that if you want similar accuracy at Carleton, you need to poll over half the student body.

• Coming Attractions: Margin of Error = and . But you’ll have to wait until we get to Statistical Inference to learn why.

1/ n0.03 1/ 1000

Page 25: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

How to Sample Badly

• Advice columnist Ann Landers once asked parents“If you had it to do over again, would you have children?”

• Do you think responses were representative of public opinion?

• Over 100,000 people responded, and 70% answered “No”!• A later survey, more carefully designed, showed 90% of

parents are happy with their decision to have children.• In a voluntary response sample, a large group of individuals

is invited to respond, and all who do respond are counted. But such samples are almost always biased toward those with strong opinions or those who are strongly motivated.

• Since the sample is not representative, the resulting voluntary response bias invalidates the survey.

Page 26: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

What Can Go Wrong?—or,How to Sample Badly

• In convenience sampling, we simply include the individuals who are convenient. But they may not be representative of the population.– A psychology professor performs an experiment using

his classroom.– A company samples opinions by using its own

customers. – Sampling mice from a large cage to study how a drug

affects physical activity: The lab assistant reaches into the cage to select the mice one at a time until 10 are chosen. But which mice will likely be chosen?

Page 27: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Other problems• Under-coverage:

– In some survey designs a portion of the population is not sampled or has a smaller representation in the sample than it has in the population.

– Using telephone directories for phone survey.• Half the households in large cities are unlisted.• About 5% of households don’t have phones.

– Random digit dialing only partially addresses this problem• Misses students in dorms, inmates in prison, soldiers in the military,

homeless people. And it’s too expensive to call Hawaii or Alaska.

• Non-response– No survey succeeds in getting responses from everyone.

• The problem is that those who don’t respond may differ from those who do.

– Bureau of Labor Statistics get 6-7% non-response rate.– But it’s common for opinion polls and market research

studies to have 75- 80% non-response rate.

Page 28: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

What Else Can Go Wrong? • Response bias refers to anything in the survey design that

influences the responses

• In particular, the wording of a question can have a big impact on the responses:

Page 29: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Some classic statistical mistakesThe Literary Digest Poll

• 1936 presidential election: Franklin Delano Roosevelt vs. Alf Landon

• The Literary Digest had called every presidential election since 1916

• Sample size: 2.4 million!• They predicted Roosevelt would lose

by 43%• In fact it was a landslide for

Roosevelt at 62%

Page 30: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

Literary Digest poll• Context

– Midst of the Great Depression– 9 million unemployed; real income down 1/3– Landon’s program: “Cut spending” – Roosevelt’s program: “Balance peoples’

budgets before the government’s budget”

• How the polling was done– Survey sent to 10 million people – And 2.4 million responded (that’s huge!)

Page 31: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

A huge sample, but The Literary Digest poll was biased

• The sampling frame was not representative of the electorate—selection bias– Based on magazine subscription lists, drivers’

registrations, country club memberships, phone numbers (when telephones were a luxury)

– Biased toward better off groups (who were more Republican)

• Voluntary response bias– Main issue was the economy

– The anti-Roosevelt forces were angry—and had a higher response rate!

Page 32: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

Year Sample size

Winner Gallup prediction

Election result

Error

1936 ~50,000 Roosevelt 55.7% 62.5% -6.8%

1940 ~50,000 Roosevelt 52.0% 55.0% -3.0%

1944 ~50,000 Roosevelt 51.5% 53.8% -2.3%

1948 ~50,000 Truman 44.5% 49.5% -5.0%

1952 5,385 Eisenhower 51.0% 55.4% -4.4%

1956 8,144 Eisenhower 59.5% 57.8% +1.7%

1960 8,015 Kennedy 51.0% 50.1% +0.9%

1964 6,625 Johnson 64.0% 61.3% +2.7%

1968 4,414 Nixon 43.0% 43.5% -0.5%

1972 3,689 Nixon 62.0% 61.8% +0.2%

1976 3,439 Carter 48.0% 50.1% -2.1%

1980 3,500 Reagan 47.0% 50.8% -3.8%

1984 3,456 Reagan 59.0% 59.2% +0.2%

1988 4,089 Bush 56.0% 53.9% +2.1%

1992 2,019 Clinton 49% 43.3% +5.7%

1996 2.,417 Clinton 52.0% 50.1% +1.9%

2000 3,129 Bush 48.0% 47.9% +0.1%

2004 1,866 Bush 49.0% 51.0% -2.0%

Page 33: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

The Year the Polls Elected Dewey

• 1948 Election: Harry Truman versus Thomas Dewey

• Every major poll (including Gallup) predicted Dewey would win by 5 percentage points

Page 34: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

What went wrong?• Pollsters chose their samples using quota sampling. Each

interviewer was assigned a fixed quota of subjects in certain categories (race, sex, age).

• For instance, an interviewer in St. Louis was required to talk to 13 people: – 6 live in the suburb, 7 in the central city– 7 men and 6 women; Over the 7 men (similar for women):

• 3 under 40 years old, 4 over 40; 1 black, 6 white.• In each category, interviewers were free to choose.• But this left room for human choice and inevitable bias.• Republicans were easier to reach. They had telephones, permanent

addresses, “nicer” neighborhoods.• So interviewers ended up with too many Republicans.• Quota sampling was abandoned for random sampling.

Page 35: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

Do you believe the poll?Do you believe the poll?What questions should you ask?What questions should you ask?

Who carried out survey?Who carried out survey? What is the population?What is the population? How was sample selected?How was sample selected? How large was the sample?How large was the sample? What was the response rate?What was the response rate? How were subjects contacted?How were subjects contacted? When was the survey conducted?When was the survey conducted? What are the exact questions asked?What are the exact questions asked?

Page 36: GATHERING AND PRODUCING DATA. How Data are Obtained Census –Everyone is included Observational Study –Observes individuals and measures variables but

4251

0011 0010 1010 1101 0001 0100 1011

To summarize . . .

• We are often interested in a population and some parameter that describes the population.

• We select a sample from that population and use a statistic from the sample to estimate the unknown parameter

• To obtain a good estimate, the sample must be as representative of the population as possible. And randomization, on average, insures a representative sample

• Possible sources of error are sampling variability and bias.– To reduce sampling variability, take a bigger sample– To reduce bias, get a better sampling design

• It’s the sample size, not the population size, that matters