introduction to statistics

Introduction to Statistics

From the Data at Hand to the World at LargePart IIntroduction to Statistics Siana HalimTOPICSPopulation and SampleSampling distribution modelsConfidence interval for proportions

References:De Veaux, Velleman , Bock, Stats, Data and Models, Pearson Addison Wesley International Edition, 2005 John A Rice, Mathematical Statistics and Data Analysis, Duxbury Press, 1995

Sampling and PopulationWed like to know about an entire population of individuals, but examining all of them is usually impractical, if not impossible. So we settle for examining a smaller group of individuals a sample- selected from the population We should select individuals for the sample at random.Randomizing protects us from the influences of all the features of our population, even ones that we may not have thought about.

The fraction of the population that youve sampled doesnt matter. Its the sample size itself thats important.

Sampling and PopulationDoes a census make sense ?It can be difficult to complete a censusPopulations rarely stand stillTaking a census can be more complex than sampling.Population and ParametersModels use mathematics to represent reality. Parameters are the key numbers in those models.A parameter used in a model for a population is called a population parameter.Any summary found from the data is a statistic.NameStatisticParameterMean (mu)Standard deviations (sigma)Correlationr (rho)Regression coefficientb (beta)Proportionp

Simple Random SamplesWe need to be sure that the statistics we compute from the sample reflect the corresponding parameter accurately (representative).How would we select a representative sample ?A Simple Random Sample (SRS)Every possible sample of the size we plan to draw has an equal chance to be selected.Each combination of people has an equal chance of being selected as well.The sampling frame is a list of individuals from which the sample is drawn.Samples drawn at random generally differ one from another. Each draw of random numbers selects different people for our sample. These differences lead to different values for the variables we measure.We call these sample-to-sample difference sampling variability.Stratified SamplingAll statistical sampling designs have in common the idea that chance, rather than human choice, is used to select to sample.Designs that are used to sample from large populations especially populations residing across large areas are often more complicated than simple random samples.

Sometimes the population is first sliced into homogeneous groups, called strata, before the sample is selected. Then simple random sampling is used within each stratum before the results are combined. This common sampling design is called stratified random sampling.Cluster and Multistage SamplingSplitting the population into similar parts or clusters can make sampling more practical.Then we could simply select one or a few clusters at random and perform a census within each of them.Sampling schemes that combine several methods are called multistage samples.Sometimes we draw a sample by selecting individuals systematically. This is called a systematic sampling.Sampling Distribution ModelsWhy do sample proportions vary at all ?How can surveys conducted at essentially the same time by the same organization asking the same questions get different result ?This answer is the heart of statistics.Its because each survey is based on different sample size.The proportion vary from sample to sample because the samples are composed of different people

Modeling the Distribution of Sample ProportionMost models are useful only when specific assumptions are true. In the case of the model for the distribution of sample proportions, there are two assumptions:The sampled values must be independent of each other.The sample size, n, must be large enough.The corresponding conditions to check before using the Normal to model the distribution of sample proportions are: 10% condition : If sampling has not been made with replacement, then the sample size, n, must be no larger than 10% of the population Success/failure condition : The sample size has to be big enough so that both np and nq are greater than 10

The Sampling Distribution Model of a Proportion Provided that the sampled values are independent and the sample size is large enough, the sampling distribution of p is modeled by a Normal model with mean and standard deviation

Proporsi sample

y is number of successn is the sample sizeThe Central Limit Theorem (CLT)As the sample size, n, increases, the mean of n independent values has a sampling distribution that tends toward a Normal model with mean equal to the population mean, , and standard deviation

The CLT requires remarkably few assumptions, so there are few conditions to check:Random sampling condition.Independence assumption

Sampling Distribution Model for MeanIf assumptions of independence and random sampling are met, and the sample size is large enough, the sampling distribution of the sample mean is modeled by a Normal model with a mean equal to the population mean, , and a standard deviation equal to

parameter in the population is estimated by

Sample mean

Sample standard deviation

Working with Sample Distribution ModelsExample 1. About 13% of the population is left-handed. A 200-seat school auditorium has been built with 15 leftie seats, seats that have the built-in desk on the left rather than the right arm of the chair. In a class of 90 students, whats the probability that there will not be enough seats for the left-handed students?Step-by-stepState what we want to know.Check the conditions.State the parameters and the sampling distribution model.Make a picture. Sketch the model and shade the area were interested in.Find the z-score or the cutoff proportion.Find the resulting probability from a table of Normal probabilities.Discuss the probability in the context of the question.

Working with Sample Distribution ModelsExample 2.Suppose that mean adult weight is 175 pounds with a standard deviation of 25 pounds. An elevator in our building has a weight limit of 10 persons or 2000 pounds. Whats the probability that the 10 people who get on the elevator overload its weight limit?

Standard ErrorWhen we estimate the standard deviation of a sampling distribution using statistics found from the data, the estimate is called a standard error.

Confidence Interval for ProportionWe 95% confidence to state that the True Proportion of the population is in our interval.ProportionConfidence Interval (Example)

Sea fans, one spectacular kind of coral, in the Caribbean Sea have been under attack by the disease aspergillosis. In June of 2000, the sea fan disease team from Dr. Drew Harvells lab randomly sampled some sea fans at the Las Redes Reef in Akumal, Mexico, at a depth of 40 feet. They found that 54 of the 104 sea fans they sampled were infected with the disease. What might this say about the prevalence of this disease among sea fans in general?Confidence Interval (Example)What can we say about the population proportion, p? Is the infected proportion of all sea fans 51.9%?

We do know, though, that the sampling distribution model of is centered at p, and we know that the standard deviation of the sampling distribution is

But we dont know p, instead well use and find the standard error,

Now we know the sampling model for should look like this:

Because its Normal, it says that about 68% of all samples of 104 see fans will have s within 1SE, 0.049, of p. And about 95% of all these samples will be within p2SEs. BUT Where is our sample proportion in this picture?

We do know that for 95% if random samples, will be no more than 2 SEs away from p. So lets look at this from s point of view. If Im , theres a 95% chance that p is no more than 2 SEs away from me. If I reach out 2 SEs, or 2 x 0.049, away from me on both sides, Im 95% sure that p will be within my grasp. Now, Ive got him! Probably.

So what can we really say about p?51.9% of all sea fans on the Las Redes Reef are infected. NO WAY!It is probably true that 51.9% of all sea fans on the Las Redes Reef are infected NOWe dont know exactly what proportion of sea fans on the Las Redes Reef are infected but we know that its within the interval 51.9% 2x4.9%. That is, its between 42.1% and 61.7% GETTING CLOSER! We dont know exactly what proportion of sea fans on the Las Redes Reef are infected, but the interval from 42.1% and 61.7% probably contains the true proportion. TRUE but a bit wishy-washy.We are 95% confident that between 42.1% and 61.7% of Las Redes Reef sea fans are infected. YES! Statement like these are called confidence intervals. Theyre the best we can do. The interval is called a one-proportion z-interval. Far better an approximate answer to the right question, than an exact answer to the wrong question.- John W. TukeyMargin of ErrorConfidence Interval (CI) has the form

The extent of the interval on either side of is called the margin of error (ME). In general, CI look like this:estimate METhe more confident we want to be, the larger the margin of error must be.

Critical Value

0.951.96-1.96

0.91.645-1.645The z* = 1.96 and z* = 1.645 is called as the critical value.The CI for the sample proportion and the sample mean can be formulated as follow

Assumptions and ConditionsIndependence Assumption check three conditions:Plausible independence condition. This condition depends on your knowledge of the situation.Randomization condition. Were the data sampled at random or generated from a properly randomized experiment?10% condition.Sample Size Assumption check success/failure condition.We must expect at least 10 successes and at least 10 failures.

One-proportion z-intervalWhen the conditions are met, we are ready to find the confidence interval for the population proportion, p. The confidence interval is

where the standard error of the proportion is estimated by

ExampleIn May 2002, the Gallup Poll asked 537 randomly sampled adults the question Generally speaking, do you believe the death penalty is applied fairly or unfairly in this country today? Of these, 53% answered Fairly and 7% said they didnt know, What can we conclude from this survey?

Student t distributiont0t (df = 5) t (df = 13)t-Distribution has similar shape as the normal distribution but it has longer tailsStandard Normal(t with df = )Note: t Z if n increaseT- DistributionUpper Tail Areadf

.25.10.0511.0003.0786.31420.8171.8862.92030.7651.6382.353t02.920This the value of t, not the value of the probability..Let: n = 3 df = n - 1 = 2 = .10 /2 =.05/2 = .05Using t distribution then the CI for mean can be formulated as

introduction to statistics

Documents

sample size

representative sample

simple random sampling

stratified random sampling

systematic sampling

sampling frame

sampling schemes

population parameter