1 data collection and sampling st 511. 2 methods of collecting data the reliability and accuracy of...

24
1 Data Collection and Sampling ST 511

Upload: della-holt

Post on 19-Jan-2018

222 views

Category:

Documents


0 download

DESCRIPTION

3 –This is often a preferred source of data due to low cost and convenience. –Published data is found as printed material, tapes, disks, and on the Internet. –Data published by the organization that has collected it is called PRIMARY DATA. For example: Data published by the US Bureau of Census. For example: Data published by the US Bureau of Census. –Data published by an organization different than the organization that has collected it is called SECONDARY DATA. For example: The Statistical abstracts of the United States, compiles data from primary sources Compustat, sells variety of financial data tapes compiled from primary sources For example: The Statistical abstracts of the United States, compiles data from primary sources Compustat, sells variety of financial data tapes compiled from primary sources Published Data

TRANSCRIPT

Page 1: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

1

Data Collection and Sampling

ST 511

Page 2: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

2

Methods of Collecting Data• The reliability and accuracy of the data affect the

validity of the results of a statistical analysis.• The reliability and accuracy of the data depend

on the method of collection.• Three of the most popular sources of statistical

data are:– Published data– Observational studies– Experimental studies

Page 3: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

3

– This is often a preferred source of data due to low cost and convenience.

– Published data is found as printed material, tapes, disks, and on the Internet.

– Data published by the organization that has collected it is called PRIMARY DATA.

For example:Data published by the US Bureau of Census.

– Data published by an organization different than the organization that has collected it is called SECONDARY DATA.

For example:•The Statistical abstracts of the United States,compiles data from primary sources• Compustat, sells variety of financial data tapescompiled from primary sources

Published Data

Page 4: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

4

– Observational study is one in which measurements representing a variable of interest are observed and recorded, without controlling any factor that might influence their values.

– Experimental study is one in which measurements representing a variable of interest are observed and recorded, while controlling factors that might influence their values.

• When published data is unavailable, one needs to conduct a study to generate the data.

Observational and experimental studies

Page 5: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

5

• Surveys solicit information from people.• Surveys can be made by means of

– personal interview– telephone interview– self-administered questionnaire

Surveys

Page 6: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

6

A good questionnaire must be well designed:• Keep the questionnaire as short as possible.• Ask short,simple, and clearly worded questions.• Start with demographic questions to help respondents get started comfortably.• Use dichotomous and multiple choice questions.• Use open-ended questions cautiously. • Avoid using leading-questions.• Pretest a questionnaire on a small number of people.• Think about the way you intend to use the collected data when preparing the questionnaire.

Surveys

Page 7: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

7

Sampling and Sampling Plans

• Motivation for conducting a sampling procedure:– Costs.– Population size.– The possible destructive nature of the sampling

process.• The sampled population and the target

population should be similar to one another.

Page 8: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

8

Sampling Plans

• We introduce four different sampling plans– Simple random samples– Stratified random samples– Cluster samples– Systematic samples

Page 9: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

9

Simple Random Samples• In simple random sampling all the samples with the

same size are equally likely to be chosen.– It is a consequence of this definition that each individual in

the population has an equal chance to be chosen• An SRS is the standard against which we measure

other sampling methods, and the sampling method on which the theory of working with sampled data is based

• To conduct random sampling… – assign a number to each element of the chosen population

(or use already given numbers),– randomly select the sample numbers (members). Use a

random numbers table, or a software package.

Page 10: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

10

Simple Random Samples (cont.)• To select a sample at random, we first need to define where the

sample will come from. – The sampling frame is a list of individuals from which the sample is

drawn.– E.g., To select a random sample of students from a college, we might

obtain a list of all registered full-time students.– When defining sampling frame, must deal with details defining the

population; are part-time students included? How about current study-abroad students?

• Once we have our sampling frame, the easiest way to choose an SRS is with random numbers.

Page 11: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

11

• Example– A government income-tax auditor is responsible for

1,000 tax returns.– The auditor will randomly select 40 returns to audit.– Use Excel’s random number generator to select the

returns.• Solution

• We generate 50 numbers between 1 and 1000 (we need only 40 numbers, but the extra might be used if duplicate numbers are generated.)

Simple Random Sampling

Page 12: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

12

Simple Random Sampling

0.3820002 382.00018 3830.1006806 100.68056 1010.5964843 596.48427 5970.8991058 899.10581 9000.8846095 884.60952 8850.9584643 958.46431 9590.0144963 14.496292 150.4074221 407.4221 4080.8632466 863.24656 8640.1385846 138.58455 1390.2450331 245.03311 246

. . .

. . .

X(1000) Round-up

38310159790088595915408864139246..

The auditor should select 40 files numbered 383, 101, ...

50 Random numbersbetween 0 and 1000,each has a probabilityof 1/1000 to be selected

50 numbers uniformly distributed between 0 and 1

50 random uniformly distributed whole-numbers between 1 and 1000.

Page 13: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

13

• This sampling procedure separates the population into mutually exclusive sets (strata), and then selects simple random samples from each stratum.

Sex• Male• Female

Age• under 20• 20-30• 31-40• 41-50

Occupation• professional• clerical• blue-collar

Stratified Random Sampling

Page 14: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

14

• With this procedure we can acquire information about– the whole population– each stratum– the relationships among strata.

Stratified Random Sampling

Page 15: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

15

Stratified Random Sampling

• There are several ways to build the stratified sample. For example, keep the proportion of each stratum in the population.

A sample of size 1,000 is to be drawn

Stratum Income Population proportion

1 under $15,000 25% 2502 15,000-29,999 40% 4003 30.000-50,000 30% 3004 over $50,000 5% 50

Stratum size

Total 1,000

Page 16: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

16

• Cluster sampling is a simple random sample of groups or clusters of elements.

• This procedure is useful when– it is difficult and costly to develop a complete list of the population

members (making it difficult to develop a simple random sampling procedure.

– the population members are widely dispersed geographically.• Cluster sampling may increase sampling error, because of

probable similarities among cluster members.

Cluster Sampling

Page 17: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

17

Systematic Samples• Sometimes we draw a sample by selecting individuals

systematically.– For example, you might survey every 10th person on an

alphabetical list of students.• To make it random, you must still start the systematic

selection from a randomly selected individual.• When there is no reason to believe that the order of the list

could be associated in any way with the responses sought, systematic sampling can give a representative sample.

Page 18: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

18

Systematic Samples (cont.)

• Systematic sampling can be much less expensive than true random sampling.

• When you use a systematic sample, you need to justify the assumption that the systematic method is not associated with any of the measured variables.

Page 19: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

19

What Can Go Wrong?—or,How to Sample Badly

• Sample Badly with Volunteers:– In a voluntary response sample, a large group of individuals is

invited to respond, and all who do respond are counted. • Voluntary response samples are almost always biased, and so

conclusions drawn from them are almost always wrong.

– Voluntary response samples are often biased toward those with strong opinions or those who are strongly motivated.

– Since the sample is not representative, the resulting voluntary response bias invalidates the survey.

Page 20: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

20

What Can Go Wrong?—or,How to Sample Badly (cont.)

• Sample Badly, but Conveniently:– In convenience sampling, we simply include the

individuals who are convenient. • Unfortunately, this group may not be representative of the

population.– Convenience sampling is not only a problem for

students or other beginning samplers.• In fact, it is a widespread problem in the business world—

the easiest people for a company to sample are its own customers.

Page 21: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

21

What Can Go Wrong?—or,How to Sample Badly (cont.)

• Sample from a Bad Sampling Frame:– An SRS from an incomplete sampling frame introduces bias

because the individuals included may differ from the ones not in the frame.

• Undercoverage:– Many of these bad survey designs suffer from undercoverage,

in which some portion of the population is not sampled at all or has a smaller representation in the sample than it has in the population.

– Undercoverage can arise for a number of reasons, but it’s always a potential source of bias.

Page 22: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

22

What Else Can Go Wrong?

• Watch out for nonrespondents.– A common and serious potential source of bias for

most surveys is nonresponse bias.– No survey succeeds in getting responses from

everyone. • The problem is that those who don’t respond may differ

from those who do.• And they may differ on just the variables we care about.

Page 23: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

23

What Else Can Go Wrong? (cont.)

• Don’t bore respondents with surveys that go on and on and on and on…– Surveys that are too long are more likely to be

refused, reducing the response rate and biasing all the results.

Page 24: 1 Data Collection and Sampling ST 511. 2 Methods of Collecting Data The reliability and accuracy of the data affect the validity of the results of a statistical

24

What Else Can Go Wrong? (cont.)

• Work hard to avoid influencing responses.– Response bias refers to anything in the survey

design that influences the responses. – For example, the wording of a question can influence the

responses: