why sample the population (3)

18
Introduction The term population is referred to any collection of individuals or of their attributes or of results of operations which can be numerically specified. Thus, there may be population of weights of individuals, heights of trees, prices of wheat, number of plants in a field, number of students in a university etc. A population with finite number of individuals or members is called a finite population. For instance, the population of ages of twenty boys in a class is an example of finite population. A population with infinite number of members is known as infinite population. The population of pressures at various points in the atmosphere is an example of infinite population. For any statistical investigation with large population size, complete enumeration (or census) of the population is impracticable, for example, estimation of average monthly income of the individuals in the entire country. Further, in some cases, if the population is infinite, then the complete enumeration is impossible. As an illustration, to know the total amount of timber available in the forest, the entire forest can not be cut to know how much timber is available there. To overcome the difficulties of complete enumeration, a part or fraction is selected from the population which is called a sample and the process of such selection is called sampling. For example, only 20 students are selected from a university or 10 plants are selected from a field. For determining the population characteristic, instead of enumerating all the units in the population, the units in the sample only are observed and the parameters of the population are estimated accordingly. Sampling is therefore resorted to when either it is impossible to enumerate all the units in the whole population or when it is too costly to enumerate in terms of time and money or when the uncertainty inherent in sampling is more than compensated by the possibilities of errors in complete enumeration. The theory of

Upload: kamini3112

Post on 28-Oct-2014

111 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Why Sample the Population (3)

Introduction

The term population is referred to any collection of individuals or of their attributes or of results of operations which can be numerically specified. Thus, there may be population of weights of individuals, heights of trees, prices of wheat, number of plants in a field, number of students in a university etc. A population with finite number of individuals or members is called a finite population. For instance, the population of ages of twenty boys in a class is an example of finite population. A population with infinite number of members is known as infinite population. The population of pressures at various points in the atmosphere is an example of infinite population. For any statistical investigation with large population size, complete enumeration (or census) of the population is impracticable, for example, estimation of average monthly income of the individuals in the entire country. Further, in some cases, if the population is infinite, then the complete enumeration is impossible. As an illustration, to know the total amount of timber available in the forest, the entire forest can not be cut to know how much timber is available there.

To overcome the difficulties of complete enumeration, a part or fraction is selected from the population which is called a sample and the process of such selection is called sampling. For example, only 20 students are selected from a university or 10 plants are selected from a field. For determining the population characteristic, instead of enumerating all the units in the population, the units in the sample only are observed and the parameters of the population are estimated accordingly. Sampling is therefore resorted to when either it is impossible to enumerate all the units in the whole population or when it is too costly to enumerate in terms of time and money or when the uncertainty inherent in sampling is more than compensated by the possibilities of errors in complete enumeration. The theory of sampling is based on the logic of particular (i.e. sample) to general (i.e. population) and hence all results will have to be expressed in terms of probability. To serve a useful purpose, sampling should be unbiased and representative.

The aim of the theory of sampling is to get as much information as possible, ideally the whole of the information about the population from which the sample has been drawn. In particular, given the form of the parent population, one would like to estimate the parameters of the population or specify the limits within which the population parameters are expected to lie with a specified degree of confidence.

Page 2: Why Sample the Population (3)

Sampling?

A process used in statistical analysis in which a predetermined number of observations will be taken from a larger population. The methodology used to sample from a larger population will depend on the type of analysis being performed, but will include simple random sampling, systematic sampling and observational sampling.

The sample should be a representation of the general population.

When taking a sample from a larger population, it is important to consider how the sample will be drawn. To get a representative sample, the sample must be drawn randomly and encompass the entire population. For example, a lottery system could be used to determine the average age of students in a University by sampling 10% of the student body, taking an equal number of students from each faculty.

Why sample the population?

1. Expense. A census may not be cost effective.

2. Speed of Response. There may not be enough time to obtain more than a sample.

3. Accuracy. A carefully obtained sample may be more accurate than a census.

4. Destructive Sampling. In destructive testing of products a sample has to suffice.

5. The large (infinite) population. Sometimes a census is impossible.

Parameter and Statistic

A number that describes a population is called a parameter A number that describes a sample is a statistic If we take a sample and calculate a statistic, we often use that Statistic to infer something

about the population from which the sample was drawn.

Page 3: Why Sample the Population (3)

What is Sampling Distribution?

Sampling is defined as the process of selecting a number of observations (subjects) from all the observations (subjects) from a particular group or population. Sampling distribution is defined as the frequency distribution of the statistic for many samples.

In statistics, a sampling distribution or finite-sample distribution is the probability distribution of a given statistic based on a random sample. Sampling distributions are important in statistics because they provide a major simplification on the route to statistical inference. More specifically, they allow analytical considerations to be based on the sampling distribution of a statistic, rather than on the joint probability distribution of all the individual sample values

The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, when derived from a random sample of size n. It may be considered as the distribution of the statistic for all possible samples from the same population of a given size. The sampling distribution depends on the underlying distribution of the population, the statistic being considered, the sampling procedure employed and the sample size used.

Features of Sampling Distribution

The 4 features of sampling distribution include:

1) The statistic of interest (Proportion, SD, or Mean)

2) Random selection of sample

3) Size of the random sample (very important)

4) The characteristics of the population being sampled.

Important sampling distributions

Some important sampling distribution, which are commonly used , are:

Sampling distribution of mean Sampling distribution of proportion Student’s ‘t’ distribution F distribution Chi- square distribution

Page 4: Why Sample the Population (3)

1. Sampling Distribution of the Mean

Suppose we draw all possible samples of size n from a population of size N. Suppose further that we compute a mean score for each sample. In this way, we create a sampling distribution of the mean.

We know the following. The mean of the population (μ) is equal to the mean of the sampling distribution (μx). And the standard error of the sampling distribution (σx) is determined by the standard deviation of the population (σ), the population size, and the sample size. These relationships are shown in the equations below:

μx = μ      and      σx = σ * sqrt( 1/n - 1/N )

Therefore, we can specify the sampling distribution of the mean whenever two conditions are met:

The population is normally distributed, or the sample size is sufficiently large. The population standard deviation σ is known.

Note: When the population size is very large, the factor 1/N is approximately equal to zero; and

the standard deviation formula reduces to: σx = σ / sqrt(n).

2. Sampling Distribution of the Proportion

In a population of size N, suppose that the probability of the occurrence of an event (dubbed a "success") is P; and the probability of the event's non-occurrence (dubbed a "failure") is Q. From this population, suppose that we draw all possible samples of size n. And finally, within each sample, suppose that we determine the proportion of successes p and failures q. In this way, we create a sampling distribution of the proportion.

We find that the mean of the sampling distribution of the proportion (μp) is equal to the probability of success in the population (P). And the standard error of the sampling distribution (σp) is determined by the standard deviation of the population (σ), the population size, and the sample size. These relationships are shown in the equations below:

μp = P      and      σp = σ * sqrt( 1/n - 1/N ) = sqrt[ PQ/n - PQ/N ]

where σ = sqrt[ PQ ].

Note: When the population size is very large, the factor PQ/N is approximately equal to zero; and the standard deviation formula reduces to: σp = sqrt( PQ/n )

Page 5: Why Sample the Population (3)
Page 6: Why Sample the Population (3)

3. Chi-Square Distribution

This distribution was initially proposed by F.R.Helmert but later on also given independently by Karl Pearson. If x1, x2,....xn are n standard normal variable i.e. each of them is distributed as normal with mean 0 and s.d as 1 , then the statistic

∑i=1

n

xi2 is said to be distribution as χ2

χ2 (chi) is a greek letter pronounced as ‘ki’.

Application of χ2 distribution

I. The chi square distribution is used in studying association between two factors or attributes of each of them being at two or more levels. Some example are-

Educational background (science, Arts ,commerce) of MBA students and their final grades in MBA.

Credit worthiness of borrowers for personal loans and their age groups. Coaching of students and their results in an examination. Attitude towards stock market and age group of investor. Returns on stocks and sectors of like banking, information etc. Training received and performance of staff like salesmen. Yield of a crop with levels of a fertilizer used.

II. the chi square distribution is also useful for studying the closeness of observed as well as expected frequency of the events. The expected frequencies could be based on either some assumption or fitting of a distribution.

Properties of chi square distribution

1. χ2, being the sum of squares, is always positive. In facts, the range of χ2 is from 0 to ∞ since the range of each xi is upto ∞.

2. The Mean , Mode and Variance of χ2 are as follows:Mean= nMode= n-2Variance = 2n

3. The shape of χ2 distribution depends on the value of n. The shape for 3 values of n

are given below. In the figure below, the red curve shows the distribution of chi-square values computed from all possible samples of size 3, where degrees of freedom is n - 1 = 3 - 1 = 2. Similarly, the the green curve shows the distribution for samples of size 5 (degrees of freedom equal to 4); and the blue curve, for samples of size 11 (degrees of freedom equal to 10).

Page 7: Why Sample the Population (3)

4. (n-1)s2/σ 2 is distributed as χ2 with (n-1)degree of freedom.

Where s2= (1/(n-1)) ∑i=0

n

¿¿-x2) and xis are distributed as N(m,σ 2)

5. χ2 is also useful for studying the closeness of observed as well as expected frequencies of an event. The expected frequencies could be based on either some assumption for example , in a hospital , 700 babies were born during a week. Based on the assumption that a baby is equally likely to be born on any day of the week, the expected number of babies

on each day of the week is 100. However , the observed or recorded birth could be different from 100 on each of the seven days . in such case, ifoi is the observed frequency or number of babies for ith( i=1,2,....7) day of a weekei is the expected frequency or number of babies for ith( i=1,2,....7) day of a weekthen,

χ2=is distributed as χ2 with (7-1)d.f.

Example:

Q. In a particular market there are three commercial television stations, each with its own evening news program from 6:00 to 6:30 P.M. According to a report in this morning's local newspaper, a random sample of 150 viewers last night revealed 53 watched the news on WNAE (channel 5), 64 watched on WRRN (channel 11), and 33 on WSPD (channel 13). At the .05 significance level, is there a difference in the proportion of viewers watching the three channels?

Ans. We use Chi Square goodness of fit test to test that there a difference in the proportion of viewers watching the three channels

Null Hypothesis : There a no difference in the proportion of viewers watching the three channels

 

Page 8: Why Sample the Population (3)

  Observed Expected (O-E)2 (O-E)2/EChannel 5 53 50 9 0.18Channel 11 64 50 196 3.92Channel 13 33 50 289 5.78

     Chi-Square 9.88

      p-value 0.007155

Test Statistics

            follows Chi Square with n-1=2 d.f

              = 9.88

P-value

         P ( 9.88) = 0.007155

Which is significant at 0.05 level of significance

Conclusion: We reject the null Hypothesis at 0.02 level of sig  as  0.007155

<0.05.Thus there is a difference in the proportion of viewers watching the three channels.

4. student’s t- distribution :

this distribution was introduced by W.S Gosset in 1907 but he preferred to name the distribution under his pen name ‘student’.

When population standard deviation (σ p)is not known and the sample is of small size (i.e.,n≤30), we use t distribution for the sampling distribution of mean and work out t variable as:

t=(×-µ)/(σ s/√ n)

where σ s =√∑( X i−X ¿)¿2 -n

√n

i.e. , the sample standard deviation.

Page 9: Why Sample the Population (3)

There is different t distribution for every possible sample size i.e., for different degree of freedom. The degree of freedom for a sample size of n is n-1 . as the sample size gets larger, the shape of the t distribution become approximately equal to the normal distribution.

In fact, for sample size more than 30, the t distribution is so close to the normal distribution that we can use the normal to approximate the t – distribution. But when n is small, the t distribution is far from normal but when n→ α, t distribution is identical with normal distribution. The t distribution tables are available which gives the critical valve of t for different degree of freedom at various levels of significance. The table value of t for given degree of freedom at a certain level of significance with the calculated value of t from the sample data and if the latter is equal to or exceeds, we infer that the null hypothesis cannot be accepted.

Degrees of Freedom

There are actually many different t distributions. The particular form of the t distribution is determined by its degrees of freedom. The degree of freedom refers to the number of independent observations in a set of data.

When estimating a mean score or a proportion from a single sample, the number of independent observations is equal to the sample size minus one. Hence, the distribution of the t statistic from samples of size 8 would be described by a t distribution having 8 - 1 or 7 degrees of freedom. Similarly, a t distribution having 15 degrees of freedom would be used with a sample of size 16.

Properties of the t Distribution

The t distribution has the following properties:

The mean of the distribution is equal to 0 . The variance is equal to v / ( v - 2 ), where v is the degrees of freedom and v > 2. more the value of n, closer it gets to the normal curve . for n > 30, it is almost

identical with the normal distribution.

Condition for use of ‘t’ distribution

The following conditions need to be satisfied for using the ‘t’ distribution.

The variable on which the observations x1, x2,...xn are recorded, follows normal distribution in the population.

The sample size(n)is small, say˂ 30. The s.d. of the variable, in the population, is unknown.

Page 10: Why Sample the Population (3)

Example

Q.Acme Corporation manufactures light bulbs. The CEO claims that an average Acme light bulb lasts 300 days. A researcher randomly selects 15 bulbs for testing. The sampled bulbs last an average of 290 days, with a standard deviation of 50 days. If the CEO's claim were true, what is the probability that 15 randomly selected bulbs would have an average life of no more than 290 days?

Solution

The first thing we need to do is compute the t score, based on the following equation:

t = [ x - μ ] / [ s / sqrt( n ) ] t = ( 290 - 300 ) / [ 50 / sqrt( 15) ] = -10 / 12.909945 = - 0.7745966

where x is the sample mean, μ is the population mean, s is the standard deviation of the sample, and n is the sample size.

The degrees of freedom are equal to 15 - 1 = 14. The t score is equal to - 0.7745966.

The table gives the cumulative probability: 0.226. Hence, if the true bulb life were 300 days, there is a 22.6% chance that the average bulb life for 15 randomly selected bulbs would be less than or equal to 290 days.

5. F Distribution

This distribution was evolved by Sir Ronald A. Fisher but the name ‘F’ was given by George W. Snedecor. ‘F’ is defined as the ratio of two variable (divided by their respective d.f.),

which are distributed as χ2. If U is distributed as χm2 and V is distributed as χn

2 , and U and V are independent, then the variable

Fm, n = (U/m)/(V/n) = ( χm2 /m)/ (χn

2/n) is distributed as ‘F’ with m and n d.f. These two numbers m and n representing d.f are the parameters of the ‘F’ distribution.

‘F’ distribution is also called Variance – ratio distribution.

The variable ‘F’, is defined as the ratio of sample estimates of variance of two normal populations- the large estimate being in the numerator and smaller estimates being in the denominator. It is expressed as

Fm, n = s12/s2

2

Page 11: Why Sample the Population (3)

Where, s12 and s2

2 are two sample variances. The sample variance could be either from two different population or two different estimates from the sample population. The subscripts m and n indicates the degrees of freedom of the statistic ‘F’. It may be noted that ‘F’ has two joint values of d.f.- first indicating the d.f. of the numerator and the second indicating the d.f. of the denominator .

Properties of ‘F’ distribution

The important properties of ‘F’ distribution are as follows: ‘F’ being the ratio of two positive quantities, is always positive. Since the range of χ2

is from 0 to ∞,the range of ‘F’ is also from 0 to ∞. m,n represent degrees of freedom, and are shape parameters deciding the shape of the

distribution. The mean of ‘F’ distribution is equal to n/(n-2) for n˃2. It is to be noted that the mean

depends only on the d.f. of denomitor.

Example

Q.A researcher claims that the variance of the distances traveled by the buses of Indiana in a particular day is greater than that of the buses of Lowa. The journeys of 31 buses of Indiana and 25 for Lowa were observed and the variances were found as 5600 and 3200 respectively. Find the critical value. Whether the claim of the researcher is true at a confidence level of 99%?

Step 1: H0: σ12 ≤ σ2

2 and H1: σ12 > σ2

2[Null and alternative hypotheses.]

Step 2: s12 = 5600, s2

2 = 3200, n1 = 31, n2 = 25[Choose s1 and s2 such that s1 > s2.]

Step 3: d.f.N = n1 - 1 = 30

d.f.= n2 - 1 = 24

Step 4: For 99% confidence, α = 0.01 gives critical value of F(30, 24) = 2.58

Step 5: The test value F = s12/s2

2 = 1.75

Step 6: Since the test value is less than the critical value, we fail to reject the null hypothesis.

Step 7: So, there is not enough evidence to say that the variance of distance traveled by the buses of Indiana is greater than that of the buses of Lowa.

Page 12: Why Sample the Population (3)

References

Books:

Page 13: Why Sample the Population (3)

Statistics for management by T N SRIVASTAVA & SHAILAJA REGO( Tata McGraw- Hill Publishing Company Limited, New Delhi)

Fundamentals of Statistics by D N ELHANCE, VEENA ELHANCE & B M AGGARWAL( Kitab Mahal)

Web Sites:

www. wikipedia.com www.investopedia.com www.google.com

Page 14: Why Sample the Population (3)

Research methodology presentation

Topic- sampling Distribution

Submitted by

Kamini Gupta

Roll no. 117662

MBA Integrated 3rd sem