3 - 2 - unit 3, part 1- (1) sampling variability and clt (21-00)

Upload: erotemethinks8580

Post on 03-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 3 - 2 - Unit 3, Part 1- (1) Sampling Variability and CLT (21-00)

    1/11

    In this video, we will define samplingdistributions.We're going to introduce the central limittheoremand review conditions required for thetheorem to apply.And we're also going to do some simulationdemos to illustratethe central limit theorem and starttalking about why it workswithout going into a theoretical proof, aswell as talk abouthow it works and why it might be of use tous.Say, we have a population of interest, andwe take our random samplefrom it.And based on that sample, we calculate asample statistic.For example, the mean of that sample.Then suppose we take another random sampleand also calculate and record its mean.Then we do this again, and again, manymore times.

    Each one of the samples will havetheir own distribution, which we callsample distributions.Each observation in these distributionsis a randomly sampled unit from thepopulation, say, a person, ora cat, or a dog, depending on whatpopulation you are studying.The values we recorded from each sample,the sample statistics, also now make a newdistribution where each observation is notaunit from the population but a sample

    statistic.In this case, a sample mean.The distribution of these samplestatistics is calledthe sampling distributions.So the two terms, sample and samplingdistributions, sound similar but they'redifferent concepts.Let's give a little more concrete example.Suppose we're interested in the averageheight of the US women.Our population of interest is US women.We'll call capital N the population size,

    andour parameter of interest is the averageheightof all women in the US, which we denote asmu.Let's assume that we have height data fromevery single woman in the US.Using dese, these data, we could find thepopulation mean.65 inches is probably a reasonable

  • 8/12/2019 3 - 2 - Unit 3, Part 1- (1) Sampling Variability and CLT (21-00)

    2/11

    estimate.Using the same population data, wecan also calculate the population standarddeviation.Which gar, we usually call sigma.We wouldn'texpect this number, the sigma, to be verysmall, sinceheights of all women in the US areprobably very variable.It's possible to find a woman as short as4 feet tall, as 7 feet tall.Then, let's assume that we take randomsamples of 1,000 women from each state.We'll start with the first list on thealphabetical list, Alabama.We sampled1,000 women from Alabama.We represent each woman in our sample withan x, and we use the subscripts tokeep track of the state, as well asthe observation number, ranging from 1 to1000.Then, we collect data from 1,000 woman

    from eachof a bunch more states, including NorthCarolina, whereI happen to be car, currently located, andthena bunch more, till finally, we get to thelaststate on the alphabetical list, Wyoming.For east state, we calculate state's meanthat we denote as x bar.So now, we have a data set consisting of abunchof means, or 50 to be exact, since there

    are 50 states.We call this distribution the samplingdistribution.The mean of the sample means will probablybearound the true population mean, roughly,65 inches as well.The standard deviation of the sample meanswillprobably be much lower than the populationstandarddeviation, since we would expect theaverage height for

    each state to be pretty close to oneanother.For example, we wouldn't expect to find astate where the average height of a randomsample of 1,000s wo, 1,000 women is as lowas 4 feet or as high as 7 feet.We call the standard deviation of thesample means the standard error.In fact,as the sample size, n, increases, the

  • 8/12/2019 3 - 2 - Unit 3, Part 1- (1) Sampling Variability and CLT (21-00)

    3/11

    standard error will decrease.The fewer women we sample from each state,themore variable we would expect the samplemeans to be.Next, we're going to illustrate what wewerejust talking about, in terms of samplingdistributions,their shapes, centers, and spreads, usingan appletthat simulates a bunch of samplingdistributions for us,given certain parameters of the populationdistribution and its shape.If you would like to also play along withus, you can follow the URL on the screen.Let's start with the default case of anormal distributionfor the population, with mean 0 andstandard deviation 20.Let's take samples of say, size 45, fromthis population.And what we can see here is that each one

    of these dot plots show us onesample of 45 observations from the normalpopulation.We can see that the centersof each one of these samples is close to0, though not exactly 0.And we can also see that the sample meanvaries from one sample to another.Since these are random samples from thepopulation, each time we reach out tothe population and grab 45 observations,we may not be getting the same sample.In fact, we will not be getting the same

    sample.And therefore, the x bars foreach sample are slightly different.The standard deviations of each one ofthesesamples should be roughly equal to thepopulationstandard deviation, because after all,each one ofthese samples are simply a subset of ourpopulation.We have illustrated eight of the firstsamples here.

    But, we're actually taking 200 samplesfrom the population.We can make this a very large number, say,1,000 samples from the population.And what we have at the very bottom isbasically our sampling distribution.Each one of the sample means, oncecalculated, get dropped to thelower plot and what we're seeing here is adistribution of sample means.

  • 8/12/2019 3 - 2 - Unit 3, Part 1- (1) Sampling Variability and CLT (21-00)

    4/11

    Since we saw that the sample means hadsome variability among them, the samplingdistribution basically illushr,illustrates for us what this variabilitylooks like.The sampling distribution, as we expected,is looking just likethe population distribution, so nearlynormal, and the center ofthe sampling distribution, so that is themean of themeans, is close to the true populationmean of 0.However, one big difference between ourpopulation distributionup top and our sampling distribution atthe bottom is the spread of thesedistributions.The sampling distribution at the bottom ismuch skinnier than the populationdistribution up top.And if you think about it, whilethe standard deviation of the populationdistribution is

    20, the standard error, so the standarddeviation of the sample means, is only2.93.The reason for thisis that while individual observations canbe very variable, itis unlikely that sample means are going tobe very variable.So, if we want to decrease the variabilityofthe sample means, what that means isyou're takingsamples that have more consistent means,

    in order todo that, we would want to increase oursample size.Let's say thatwe increase our sample size to, all theway to 500, alright?So we have here is, again, our samepopulation distribution.Here, we're seeing the first eight ofthe 1,000 samples being taken from thepopulation.The distributions look much more densehere because we simply have more

    observations.So, each one of these samples represent asample from the population of 500observations.And we can also see that the means are,again, variable,but let's check to see if they're asvariable as before.The curve is indeed skinnier, so thehigher the sample size of each sample that

  • 8/12/2019 3 - 2 - Unit 3, Part 1- (1) Sampling Variability and CLT (21-00)

    5/11

    you're taking from the population, the ve,less variable the means of those samples.And indeed, we can see it graphically,looking at the curve.And we can see it numerically, looking atthe value of the standard error.Now it's finally time to introduce thecentral limit theorem.In fact, the central limit theorem saysthat the samplingdistribution of the mean, distribution ofsample means from many samples,is nearly normal, centered at thepopulation mean, with standarderror equal to the population standarddeviation divided by the squareroot of the sample size.Note that this is called the central limittheorembecause it's central to much of thestatistical inference theory.So the central limit theorem tells usabout the shape, which itsays that it's going to be nearly normal,

    the center, which itsays that the sampling distribution'sgoing to be centered at the populationmean, and the spread of the samplingdistribution, which we measure usingthe standard error.If sigma is unknown, which is often thecase, remember,sigma is the population standarddeviation, and oftentimes, we don't haveaccess to the entire population tocalculate this number, weuse S, the sample standard deviation, to

    estimate the standard error.So that would be the standard deviation ofone sample that we happen to have at hand.In the earlier demo, the simulation,we talked about taking many samples, butif you're running astudy, as you can imagine, you would onlytake one sample.So that's the standard deviation of thatsample that wewould use as our best guess for thepopulation standard deviation.So it wasn't a coincidence that the

    sampling distribution we saw earlier wassymmetricand centered at the true population mean,and that as n increased, the samplesized increased, the standard errordecreased.We won't go through a detailed proof ofwhy the standard error is equal tosigma over square root of n, butunderstanding

  • 8/12/2019 3 - 2 - Unit 3, Part 1- (1) Sampling Variability and CLT (21-00)

    6/11

    the inverse relationship between them isvery important.As the sample size increases, we wouldexpect samples to yield more consistentsample means, hence the variability amongthe sample means would be lower,which results in a lower standard error.Certain conditions must be met for thecentral limit theorem to apply.The first one is independence.Sampled observations must be independent.And this is very difficult to verify.But it is more likely, if we have usedrandom sampling or assignment, dependingon whether we have an observational studywhere we're sampling from the populationrandomly orwe have an experiment where we'rerandomly assigning experimental units tovarious treatments.And, if sampling without replacement, thesample sizen is less than 10% of the population.So, we've previously mentioned, we love

    large samples, and now we'resaying that, well, we don't exactly wantthem to be very large.We're going to talk about why this is thecase in a moment.The other condition is related to thesample size, or skew.Either the population distribution isnormal,or, if the population distribution isskewedor we have no idea what it looks like, thesample size is large.

    According to the central limit theorem, ifthe population distribution is normal,the sampling distribution will also benearly normal, regardless of the samplesize.We illusta,illustrated this earlier when we wereworking with theapplet, where we looked at a sample sizeof45, as well as a sample size of 500,and in both instances, the samplingdistribution was nearly normal.

    However, if the population distribution isnot normal, the more skewed the populationdistribution, the larger sample size weneedfor the central limit theorem to apply.For moderatelyskewed distributions, n greater than 30 isa widely-used rule of thumbthat we're going to make use of often inthis course as well.

  • 8/12/2019 3 - 2 - Unit 3, Part 1- (1) Sampling Variability and CLT (21-00)

    7/11

    This distribution of the population isalso something very difficult toverify, because we often do not know whatthe population looks like.That's why we're doing this investigationin the first place.But, we can check it using the sample dataandassume that the sample mirrors thepopulation.So if you make a plot over yoursample distribution, and it looks nearlynormal, then youmight be fairly certain that the parentpopulationdistribution it's coming from is nearlynormal as well.We'll discuss these conditions in moredetail in the next couple of slides.First, let's focus on the 10% condition.If sampling without replacement and needsto be lessthan 10% of the population, is what westated earlier.

    Why is this the case?So let's think about this for a moment.Say that you live in a very small town.Say, that the population of the town isonly 1,000 people, alright?And your family lives there as well,including your extended family.Say that I'm a researcher who isdoing research on some geneticapplication.And I actually want to randomly samplesome individuals from your town.Say I take a random sample of say, size

    just 10.If we're randomly sampling 10 people outof1,000, and let's say you are included inoursample; it's going to be quite unlikelythat yourparents are also included in that sampleas well.because remember, we're only grabbing 10out of a sam, a population of 1,000.But say, on the other hand, Iactually sample 500 people from the 1,000

    that live in your town.If in this town, you live with yourparents and allof your extended family, and I've alreadygrabbed you to be inmy sample, and I have 499 other people tograb, chances areI might get somebody from your family inmy sample as well.You

  • 8/12/2019 3 - 2 - Unit 3, Part 1- (1) Sampling Variability and CLT (21-00)

    8/11

    and a family member of yours are notgenetically independent, becauseobservations in the population itself arenot independent of each other, often.So therefore, if we grab a very bigportionof the population to be in our sample,it'sgoing to be very difficult to make surethatthe sampled individuals are notindependent of each other.That's why while we like large samples, wealso want to keep the size of our samplessomewhat proportional to our population.And a good rule of thumb, usually, ifwe're sampling without replacement, isgoing to be that we don't grab more than10% of the population to be in our sample.When you're sampling with replacement,which is not something weoften do in survey settings, because ifI've already sampledyou once, and given you a survey, and

    gotten yourresponses, I don't want to be able tosample you again.I don't need your responses again.But if I were sampling withoutreplacement, then the probability ofsampling youversus somebody from your family wouldstay consistent throughout all of thetrials.That's why we wouldn't need to worry aboutthe 10% condition there.But again, in realistic survey sampling

    situations,we sample without replacement, and we likelarge samples, but we also do not want oursamples to be much larger than,or, any more than 10% of our population.And what about the sample size skewcondition?Say we have a skewed populationdistribution.Here, we have a population distributionthat's extremely right skewed.When the sample size is small, here we'relooking at a sampling distribution created

    based on samples of n just equals 10, thesample means will be quite variable, andthe shape of their distribution will mimicthe population distribution.Increasing the sample size a bit, nowwe've gonefrom n equals 10 to n equals 100, thisdecreasesthe standard error, and the distributionstarts to condense

  • 8/12/2019 3 - 2 - Unit 3, Part 1- (1) Sampling Variability and CLT (21-00)

    9/11

    around the mean and starts looking moreunimodal and symmetric.With quite large samples, here we'relooking at a samplingdistribution where, for each of theindividual samples based on whichthe sample means were calculated, the,those sample sizes were 200.With quite large samples like this, we canactually overcome the effect of the parentdistribution, andthe central limit theorem kicks in, andthesampling distribution starts to resemble aclosely normal distribution.Why are we somewhat obsessed withhaving nearly normal samplingdistributions?Because we've learned earlier that onceyou have a normal distribution,calculating probabilities,which will later serve as our p-valuesin our hypothesis tests, are relativelysimple.

    So, having a nearly normal samplingdistribution thatrelies on central limit theory is actuallygoingto open up a bunch of doors forus for doing statistical inference usingconfidence intervals andhypothesis tests using normal distributiontheory.Let's do another demo real quick.We've looked earlier at what does asampling distributionlook like when we have nearly normal

    population distribution.Let's take a look to see what happensif the population distribution is notnearly normal.Suppose I first pick a uniformdistribution.Here, we can see that our populationdistribution is uniform.Let's say that it's going to be uniformbetween 4, andobviously, our upper bound needs to begreater, 4 and 12.So we can see a, a uniform distribution

    between 4and 12, absolutely no peak, so on and soforth.Say that we're actually taking samples ofsize just 15 from this distribution.Each one of our samples contains 15observationsfrom the parent population, and the centerof these samplesare going to be somewhere close to the

  • 8/12/2019 3 - 2 - Unit 3, Part 1- (1) Sampling Variability and CLT (21-00)

    10/11

    population mean.We take a bunch of these samples, 1,000 ofthem, andlet's take a look at what the samplingdistribution is looking like.It actually looks fairly symmetric,unimodal and symmetric.The center of the distribution is veryclose to our population distribution mean,and the variability of this distributionisactually much lower than our populationdistribution.We can see that the standard error is0.59, while the original populationstandard deviation was 2.31.What happens if we have skewed data?Here, we have a population distributionthat's right skewed.We're taking samples of size 15.And let's actually make this an extremelyrightskewed distribution.So this is what this looks like.

    If we're taking samples of size 15, sothe, here, we're taking a look ateach one of our individual samples, thesampling distribution is looking an awfullot skewed.However, if I increase my sample size tobe much larger, say 500,then my sampling distribution is startingto look much more unimodal andsymmetric and starting to resemble anearly normal distribution.What about a left skewed distribution?Once again, let's make the skew of this

    distribution pretty high.And we can see that our samplingdistribution,when we have a large number ofobservations ineach sample, we have still kept it at500 observations, the samplingdistribution looks pretty nearly normal.However, if I was to decrease my samplesize to be something pretty small, 24,let's say, then my sampling distributionis looking more and more left skewed.And in fact, if I take even smaller

    samples, let's go all the way downto 12, for example, now the distributionis looking even more skewed.If I, though,decrease the skew, and my populationdistributionto begin with is not looking all thatskewed anyway, then I really don't need awhole lot of observations in my sample.Here, I have only 12 observations in each

  • 8/12/2019 3 - 2 - Unit 3, Part 1- (1) Sampling Variability and CLT (21-00)

    11/11

    sample, andthe sampling distribution is alreadylooking pretty unimodal and symmetric.So the moral of this story is, the morethe skew, the higher the sample size youneedfor the central limit theorem to kick in.Please feel free to go play with thisapplet, interact with it, andfind out for yourself what thesampling distribution looks like invarious scenarios.And also play around with the differentparameters of thedistributions, either picking how skewedthey are, if it's a uniformdistribution, what the minimum and themaximum are, of if it'sa normal distribution, what the mean andthe standard deviation are.