week 7 olympics: does medal count depend on population...
TRANSCRIPT
STAT100 Notes March2,2010 Week7 Olympics:Doesmedalcountdependonpopulationsize? Sampling:PoliticalPolls.PopulationsandSamples RandomizedResponseTechnique Hill:EvaluatingSchoolChoicePrograms Heiligetal:LeveragingChanceinHIVResearch Assignment#7:SampleMidtermII,DueMarch9,4pm.(Isweb‐postednow.)Olympics: Q:Isitpossibletocompareacountry’ssuccessinwinningOlympicmedalsinawaythatallowsforthecountry’spopulationsize?(Rprogram:medals.demo()) country medals.s pop gold medals.sw 1 usa 37 309 9 147 2 ger 30 82 10 71 3 nor 23 5 9 32 4 can 26 34 14 44 5 rus 15 142 3 87 6 kor 14 75 6 51 7 aus 16 8 4 19 8 fra 11 65 2 52 9 swi 9 8 6 16 10 chi 11 1336 5 111 11 swe 11 9 5 16 12 net 8 17 4 24 13 cze 6 11 2 12 14 pol 6 38 1 16 15 ita 5 60 1 32 16 ast 3 22 2 49 17 slk 3 5 1 9 18 jap 5 127 0 31 19 lat 2 2 0 5 20 bel 3 10 1 20 21 cro 3 4 0 8 22 sln 3 2 0 8 23 gbr 1 62 1 48 24 est 1 1 0 3 25 fin 5 5 0 9 26 kaz 1 16 0 14
ButChinaissopopulouscomparedtotheothercountriesinthewinterOlympicsthattherelationshipofmedalstopopulationissquashed.WeputasideChinaforthisfurtherinvestigationoftherelationship.Theresultisshownbelow.
0 200 400 600 800 1000 1200
05
10
15
20
25
30
35
Winter 2010 Olympics
Population
Num
ber
of M
edals
usa
ger
nor
can
rus
kor
aus
fra
swi
chiswe
net
czepolita
astslk
jap
latbelcrosln
gbrest
fin
kaz
Therelationshipisstillabitobscuredbytoomanysmallcountriesinthelowerleftcornerofthegraph.Let’sstretchitoutbytakingthesquarerootofpopulation.…
0 50 100 150 200 250 300
020
40
60
80
100
120
140
Winter 2010 & Summer 2008 Olympics
Population
Num
ber
of M
edals
usa
ger
nor
can
rus
kor
aus
fra
swiswe
net
cze
pol
ita
ast
slk
jap
lat
bel
crosln
gbr
est
fin
kaz
Nowwehaveaclearerrelationship.Butisitlegaltouse√Population?Asimpler(butequivalent)waytoshowtherelationshipistoallowtherelationshiptobecurved.Theidenticalfitcanthenbeshownwithoutthesquareroottrick.
5 10 15
020
40
60
80
100
120
140
Winter 2010 & Summer 2008 Olympics
!Population
Num
ber
of M
edals
usa
ger
nor
can
rus
kor
aus
fra
swiswe
net
cze
pol
ita
ast
slk
jap
lat
bel
crosln
gbr
est
fin
kaz
Whatuseisthisgraph?Clearlythereareimportantdifferencesthatweknowabout,suchaspercapitaGNP,thatwouldallowsomecountriestofundamoremedal‐winningteam.Butatleastwecanseetheeffectofpopulationbaseandadjustforthatincomparingcountries.Theresidualsintheabovegraphareameasureoftheextentthatacountry’smedalcountisnotexplainedbypopulation.AndforcountrieslikeNorway,Canada,andSlovenia,andothers,thisisausefulthing!Sampling:PoliticalPollsJustpriortotheFall2008election,thefollowinggraphwaspublishedonthecoverofthemagazineLiaison:
0 50 100 150 200 250 300
020
40
60
80
100
120
140
Winter 2010 & Summer 2008 Olympics
Population
Num
ber
of M
edals
usa
ger
nor
can
rus
kor
aus
fra
swiswe
net
cze
pol
ita
ast
slk
jap
lat
bel
crosln
gbr
est
fin
kaz
ThegraphwasproducedbyAndrewHeardofthePoliticalScienceDeptatSFU.Theheavylinesaresmoothesofthefinerlines,usingthemovingaveragetechnique.ThefinerlinesjoinuppointsfromsurveysdonebyvariousorganizationsinCanada.WeknowtheresultwasaminoritygovernmentoftheConservativeparty,whichwouldbenosurprisetosomeonelookingatthischartbeforetheelection.(Canada’sfirst‐past‐the‐postelectionresultsystemtendstoamplifysmalldifferencesinthepopularvote,butnotenoughinthisinstancetogivetheConservativesamajority.)Hereisanupdatedversionoftheabovegraph,alsosuppliedbySFUsAndrewHeard:
Hereisthequestionmostimportantforourpurposes:Q:Howcananopinionpollbasedonafewthousandresponsesbeareliableindicationoftheaggregateopinionsofmillions?Nationalpoliticalpollstendtosampleabout2,000peoplewhilethepopulationofCanadaisabout35,000,000.Ifthesesamplesurveysproducereliableinformation,theyarequiteusefulandevenamazing,right?Hereisthekeyformulathatexplainsthisamazingresult:
€
SDsample.mean =
€
SDpopulation
n.
€
N − nN −1
HereNisthepopulationsize,andnisthesamplesize.Exceptforthelastsquareroot,thisformulashouldlookfamiliar.Rememberthataverages(=means)varylessthatthethingsthatareaveraged,andthereductioninvariabilityisafactor
€
n .Weneedtoexplainthesuddenadditionofthatlastterm
€
N − nN −1
.
Whenwetalkaboutsamplingfromanormalprobabilitydistribution,orfromanyprobabilitydistribution,weusuallyimplicitlyassumethattheremovalofanitemfromthepopulation(drawnintothesample)doesnotchangethemakeupofthepopulation–justasifthepopulationwereinfinite,or,ifweweresamplingwith
replacement.Butinpollsfromrealpopulations(asopposedtoabstractonesdefinedbyaprobabilitydistribution)wemightneedtoconsiderthatwearereallysamplingwithoutreplacement.Q:Ifwechooseasamplewithoutreplacementfromareal(andfinite)population,andwedecidetosamplethewholepopulation,howvariablewouldthesamplemeanbeinsuchaprocess?Commonsenseanswersthisquestion,butitiscomfortingthattheaboveformulaalsoprovidesthesameanswer!Now,inpoliticalpollsliketheonesmentionedintheintroabove,thesamplesizenismuchsmallerthanthepopulationsizeN.Inthiscase,whathappenstothefactor
€
N − nN −1
?Clearly,itwillbeverycloseto1,andmultiplyingitbythemorefamiliar
partoftheformulawillnotchangeanything.(rewriting
€
N − nN −1
as
€
1− n −1N −1
makesthisevenmoreclear).Aruleofthumbis,whenn/Nislessthan1/20,ignorethefactor.Thefactoriscommonlycalled“thefinitepopulationcorrectionfactor”butperhapsamoredescriptivenameforitwouldbe“thecorrectionfactorforsamplingwithoutreplacement”.Thetwonamesareequivalent–asabitofreflectionreveals.Oneimportantconclusionfromthediscussionisthat,inpoliticalpolling,onecan
ignorethefactor
€
N − nN −1
,andsooneisleftwiththeusualsquarerootrelationship
betweenthepopulationSDandtheSDofthesamplemean.Q:Politicalpollsusuallyendupwithresultsexpressedintermofproportionsofpercentages,whereasmeansareusuallycomputedforquantitativedata(likeincome,importsorGNP).Howdoestheabovediscussionapplytopercentages?Rememberthataproportionisanaverageof0‐1dataandanyproportioncanbeconsideredtobebasedon0‐1data(proportionof1s).Sotheabovediscussionapplies.Oneothertechnicalitemneedstobere‐introducedhere.Apopulationof0sand1shasanSDthatcanbecomputedfromashort‐cutformulawhichdependsonlyontheproportionof1s,p,inthepopulation.
€
SD0−1population =
€
p(1− p)) andsotheSDofasamplemean(insamplingwithreplacement,asismostusual)is
€
SDsample.proportion. ˆ p =
€
p(1− p))n
Notethat
€
ˆ p isanestimatebasedontheproportionof1sinthesample,butpistheactual(andusuallyunknown)proportionof1sinthepopulation.Whenwewant
€
SDsample.proportion. ˆ p anddonotknowthetruepopulationpweusuallypluginour
estimate
€
ˆ p soweestimatethe
€
SDsample.proportion. ˆ p by
€
ˆ p (1− ˆ p ))n
.
Sohereisanexample:Supposeapopulationof10,000,000peopleissampledwithasampleofsize1000.Since1000/10,000,000islessthan1/20weignorethecorrectionfactor.Inoursampleof1000wefind4231sand5770s.Soourestimateofpis
€
ˆ p =423/1000=.423andtheaccuracyofthisestimateisindicatedbythe
€
SDsample.proportion. ˆ p =
€
ˆ p (1− ˆ p ))n
=
€
.423(.577))1000
=.016
Wecanreportthisinvariousequivalentways:
1) proportionof1sinthepopulationisestimatedas.423withSD=.0162) percentageof1sinthepopulationisestimatedas42.3%withSD1.6%3) percentageof1sinthepopulationisestimatedasin(39.1%,45.5%)19times
outof20.4) A95%ConfidenceIntervalforthepopulationpercentageof1’sis
(39.1%,45.5%)Q:Whenweplugin
€
ˆ p forpintheSDformulafor
€
ˆ p ,wechangeanexactformulatoanapproximateformula.Isthereamoreconservativeprocedure–whatisthelargestpossiblevalueof
€
p(1− p)) ?Thelargestpossiblevalueof
€
p(1− p)) is½,thevaluewhenp=1/2soaconservative
estimateof
€
SDsample.proportion. ˆ p =
€
12
n.
Q:Usingtheconservativeformula,howbigdoesasamplehavetobe(inabigpopulation)toestimatethesampleproportiontowithin3percentagepoints?Wewantthewidthofthe95%ConfidenceIntervalforaproportiontobe.03
percentagepoints,so2times
€
12
nmustbe.03.Thus
€
n mustbe
€
1.03
andnmustbe
atleast1111.So?
Politicalpollsareusuallybasedonafewthousandpeople.Ifyousay“whymorethan1111?”,theansweristhatreal‐worldpoliticalpollsarenotsimplerandomsamplesbutmorecomplicateddesigns,andtogetthesameinformation,thesemorecomplicatedsampleshavetobebiggerthanasimplerandomsample.Buttheadvantageofrandomsamplingisstillanimportantpartofthesemorecomplexdesigns.Q:DidweassumeNormalityintheabovediscussion?(Yes.Thefactor“2”fora95%ConfidenceIntervalisbasedontheNormaldistribution,andthejustificationforitistheCentralLimittheorem,whichsaysthataverages(andproportions)haveapproxNdistributions,withtheapproxgettingbetterandbetterasthesamplesizeincreases.)Q:Whenasurveyrespondentisaskedtheirpoliticalpreference,doestherespondentworryabouttheconfidentialityoftheinformation,andwilltherespondentanswertruthfullywithoutregardtoconfidentiality?A:Yusually,Nnotalways.RandomizedResponseSurveysHowtogetreliableconfidentialinformationwithoutanonymity!Everyoneintheclassusesacointotoss,resultinginHorT.Eachstudentshouldkeeptheoutcomeofthetossprivate.Accordingtotheoutcomeanswereitherquestion1or2byraisingyourhandforananswer“Yes”whenaskedbytheprof.Q1.(tobeansweredifyourtosswasahead).Isyourcoinahead?Q2.(tobeansweredifyourtosswasatail).Areyouaregularuserofmarijuana(usuallyonceperweekormore)?Toraiseyourhandinthissituation,youneedtobeassuredthatyourpeerswillnotknowifyougotaHorT,andsoayescouldmeanthatyoudidgetaheadoritcouldmeanthatyouareansweringYestothemarijuanaquestion.Butnobodyexceptyouwillknowwhichquestionyouareasking.Ididthisin2002withtheSTAT100classofabout100students.ThenIhad60Yesresponses(60handsup).SinceabouthalfofthesewereQ1answerers,onlyabout10oftheother50wereansweringYestoQ2.Sothepercentageofregularmarijuanauserswasestimatedtobe20%(10outof50).Wastheconfidentialityofresponsereallyprotected?Considerthe60answeringYestowhicheverquestiontheyanswered.Thereare5yessestoQ1foreach1yesto
Q2.SoapersonansweringYeshadaprobabilityof5/6or83%tobeansweringtheinnocuousQ1.Q:IsRandomizedResponseTechniquejustaparlourgameoraseriousconfidentialitystrategy?Thegovernmentandmanyotherlargecompanieshavehugedatabasisthattheymaywishtomakeavailabletolegitimateusers,buttheydonotwanttocompromisetheconfidentialityoftheinformationbyprovidingenoughinfoaboutindividualsthatsomeindividualsareidentifiedbytheirdataevenifnamesandaddressesareremoved.Onestrategythatisusedisthatthedataprovidedismessedupwithrandomerrorssothatindividualfilesaremodifiedbuttheaggregateofthefileshasthestatisticalinformationwantedbytheuser.Sovariationsothetechniqueareactuallyused.SamplingStudies Thebasicideainsamplingisthatitisaloteasierandcheapertobaseinformationaboutapopulationonasamplefromthepopulation,ratherthanmeasuretheentirepopulation.And,ifthesampleisarandomsample,oratleastusingadesignbasedonrandomization,thenthesampleitselfprovidesinformationabouttheaccuracyofthesampleasarepresentativeofthepopulation.Aswehaveseenwithpoliticalpolls,whenthepopulationishuge,samplingisaparticularlyefficientwayofgettingpopulationinformation. However,inmorecomplexsituations,samplingstudiescanbeahugechallenge,andareonlyattemptedwhenthegoalisveryworthwhile.Thetworeadingsdescribedbelowdescribethreesamplingstudiesthatsoughtimportantinformation,andinspiteoftheexpertiseinthedesignofthestudies,manypracticalproblemswereencountered.Thediscussionofthegoals,andtheeffortstoovercometheobstacles,isinstructiveforanyonetryingtojudgethevalidityoffindingsfromsuchstudies.Hill:EvaluatingSchoolChoicePrograms(pp6987)Aquestionmanypeoplewouldliketoknowtheanswertois:Areprivateschoolsbetterthanpublicschools?Butthisistoobigaquestion.Onethatmighthaveanansweris“AreprivateschoolsbetteratteachingthethreeRsthanpublicschools”?TheHillarticledescribesaseriousefforttofindananswertothisquestion.Thearticlefirstdiscusseswhythestudiesthathavebeendonesofardonotanswerthequestionsatisfactorily–mainlybecausetheyareobservationalstudies.Oftenexperimentswithhumansubjectsareunethicalandcannotgetapproval.Butthearticledescribesasituationinwhicha“natural”experimentispossiblemeaningthatitispossibletorandomlyassigntheproposed“cause”(privateschoolvspublicschool)totheexperimentalunits(studentsacceptedintothescholarshipprogram)
sothattheotherpossibleinfluenceswouldbebalancedbetweenthetwocomparisongroups(thatis,thegroupofstudentsassignedtotheprivateschoolandthegroupofstudentsassignedtothepublicschool.)Buttherewerepracticalproblemsthatmadethecomparisonabitambiguouseventhoughthedesignofthestudywassetupasan“experiment”.Thereweremissingdata:althoughthestudentsweretobetestedafterthreeyearsintheassignedschoolsystem,notallofthemshowedupforthetesting.Anotherproblemwasnoncompliance:notallstudentswhowereassignedtoonetypeofschoolacceptedthatassignment.Finally,therewerethesubjectivechoicesmademytheevaluatorswhencertainvariablesweredefinedfortheanalysis.Thearticlediscussesthewaysinwhichtheinvestigatorstriedtolessentheimpactofthesedeviationsfromapurelyobjectiveexperiment,andtheynodoubtwereabletolessensomeofthepotentialproblems,butasusualthepracticalproblemsinstudiesofhumansubjectsalwaysleavethefinalinferencetobetosomeextentsubjective.Insofarastheproblemscanbeignored,thefinalresultwasthatthepublicschoolsystemdidseemtohelpthechildrenfromlow‐incomefamilies,butonlytheonesfromAfrican‐Americanfamilies.Thisconclusionleftlotsofroomforfurtherstudies!Heiligetal“LeveragingChanceinHIVResearch”(pp227242)IntheHillstudyonschooltypes,themainproblemwastogetthehumansubjectstodowhattheywereassignedtodo.Butthelistofexperimentalunitswasreadilyavailable.InthisHIVstudythemainproblemwastogetalistofthestudysubjects,namelytheyoung“menwhohavesexwithothermen”inonesurveyand“childrenborntomotherswithHIV”intheothersurvey.IntheYoungMen’sSurvey(YMS),alistofmeetingvenuesforhomosexualmenwasconstructedandarandomsubsetofthesevisitedforinformationcollection.Alleligiblesubjectsattheselectedvenueswereaskedtoprovideinformationforthesurvey.Thetimesofthevisitswerealsorandomized.Theserandomizationstrategiesallowedforscale‐upoftheinformationtotheentirepopulationofinterest.Forparticipantsagreeingtoparticipate,bloodtestingwasdonetodeterminethepresenceofHIV.Table1providessomeresults(p233).Usingtheoddsratio,itwasdeterminedthatblackmeninthisstudypopulationhadasignificantlyhigherriskofHIVthanwhitemeninthesamepopulation.[Noteonterminology:Incidence:numberofnewcasesperyearper1000populationPrevalence:numberofexistingcasesperyearper1000population]IntheHIVNETClinicalTrialinUganda,theaimwastocomparetheeffectivenessoftwodrugs,ZDVandNVP,inpreventionofHIVtransmissionfrommothertoinfant.
Thedesignofaclassicalclinicaltrialincludesthefollowingelements:doubleblindtreatments,includingtheuseofaplacebo,andrandomizationoftreatmentstocases.(Youneedtounderstandthesetermsandwhytheyareimportant.)However,theUgandatrialdidnotusethedouble‐blindstrategy.TheThailandandUS/Francetrialsdidusethedoubleblindstrategy.TheresultsofthethreestudiesaresummarizedinTable3onpage240.Thethreetrialshavecomparableresults:TheNVPdrugissuperiortotheZDVinthiscontext.Onpage239,mentionismadeofamethodof“survivalanalysis”for“censored”data.WewillexplorethistechniqueinWeek9ofthecourse.Reminder:BesuretoaccessAssignment#7fromthecoursewebpage.