week 7 olympics: does medal count depend on population...

STAT100 Notes March2,2010 Week7 Olympics:Doesmedalcountdependonpopulationsize? Sampling:PoliticalPolls.PopulationsandSamples RandomizedResponseTechnique Hill:EvaluatingSchoolChoicePrograms Heiligetal:LeveragingChanceinHIVResearch Assignment#7:SampleMidtermII,DueMarch9,4pm.(Isweb‐postednow.)Olympics: Q:Isitpossibletocompareacountry’ssuccessinwinningOlympicmedalsinawaythatallowsforthecountry’spopulationsize?(Rprogram:medals.demo()) country medals.s pop gold medals.sw 1 usa 37 309 9 147 2 ger 30 82 10 71 3 nor 23 5 9 32 4 can 26 34 14 44 5 rus 15 142 3 87 6 kor 14 75 6 51 7 aus 16 8 4 19 8 fra 11 65 2 52 9 swi 9 8 6 16 10 chi 11 1336 5 111 11 swe 11 9 5 16 12 net 8 17 4 24 13 cze 6 11 2 12 14 pol 6 38 1 16 15 ita 5 60 1 32 16 ast 3 22 2 49 17 slk 3 5 1 9 18 jap 5 127 0 31 19 lat 2 2 0 5 20 bel 3 10 1 20 21 cro 3 4 0 8 22 sln 3 2 0 8 23 gbr 1 62 1 48 24 est 1 1 0 3 25 fin 5 5 0 9 26 kaz 1 16 0 14

ButChinaissopopulouscomparedtotheothercountriesinthewinterOlympicsthattherelationshipofmedalstopopulationissquashed.WeputasideChinaforthisfurtherinvestigationoftherelationship.Theresultisshownbelow.

0 200 400 600 800 1000 1200

05

10

15

20

25

30

35

Winter 2010 Olympics

Population

Num

ber

of M

edals

usa

ger

nor

can

rus

kor

aus

fra

swi

chiswe

net

czepolita

astslk

jap

latbelcrosln

gbrest

fin

kaz

Therelationshipisstillabitobscuredbytoomanysmallcountriesinthelowerleftcornerofthegraph.Let’sstretchitoutbytakingthesquarerootofpopulation.…

0 50 100 150 200 250 300

020

40

60

80

100

120

140

Winter 2010 & Summer 2008 Olympics

Population

Num

ber

of M

edals

usa

ger

nor

can

rus

kor

aus

fra

swiswe

net

cze

pol

ita

ast

slk

jap

lat

bel

crosln

gbr

est

fin

kaz

Nowwehaveaclearerrelationship.Butisitlegaltouse√Population?Asimpler(butequivalent)waytoshowtherelationshipistoallowtherelationshiptobecurved.Theidenticalfitcanthenbeshownwithoutthesquareroottrick.

5 10 15

020

40

60

80

100

120

140


!Population

Num

ber

of M

edals

usa

ger

nor

can

rus

kor

aus

fra

swiswe

net

cze

pol

ita

ast

slk

jap

lat

bel

crosln

gbr

est

fin

kaz

Whatuseisthisgraph?Clearlythereareimportantdifferencesthatweknowabout,suchaspercapitaGNP,thatwouldallowsomecountriestofundamoremedal‐winningteam.Butatleastwecanseetheeffectofpopulationbaseandadjustforthatincomparingcountries.Theresidualsintheabovegraphareameasureoftheextentthatacountry’smedalcountisnotexplainedbypopulation.AndforcountrieslikeNorway,Canada,andSlovenia,andothers,thisisausefulthing!Sampling:PoliticalPollsJustpriortotheFall2008election,thefollowinggraphwaspublishedonthecoverofthemagazineLiaison:

0 50 100 150 200 250 300

020

40

60

80

100

120

140


Population

Num

ber

of M

edals

usa

ger

nor

can

rus

kor

aus

fra

swiswe

net

cze

pol

ita

ast

slk

jap

lat

bel

crosln

gbr

est

fin

kaz

ThegraphwasproducedbyAndrewHeardofthePoliticalScienceDeptatSFU.Theheavylinesaresmoothesofthefinerlines,usingthemovingaveragetechnique.ThefinerlinesjoinuppointsfromsurveysdonebyvariousorganizationsinCanada.WeknowtheresultwasaminoritygovernmentoftheConservativeparty,whichwouldbenosurprisetosomeonelookingatthischartbeforetheelection.(Canada’sfirst‐past‐the‐postelectionresultsystemtendstoamplifysmalldifferencesinthepopularvote,butnotenoughinthisinstancetogivetheConservativesamajority.)Hereisanupdatedversionoftheabovegraph,alsosuppliedbySFUsAndrewHeard:

Hereisthequestionmostimportantforourpurposes:Q:Howcananopinionpollbasedonafewthousandresponsesbeareliableindicationoftheaggregateopinionsofmillions?Nationalpoliticalpollstendtosampleabout2,000peoplewhilethepopulationofCanadaisabout35,000,000.Ifthesesamplesurveysproducereliableinformation,theyarequiteusefulandevenamazing,right?Hereisthekeyformulathatexplainsthisamazingresult:

€

SDsample.mean =

€

SDpopulation

n.

€

N − nN −1

HereNisthepopulationsize,andnisthesamplesize.Exceptforthelastsquareroot,thisformulashouldlookfamiliar.Rememberthataverages(=means)varylessthatthethingsthatareaveraged,andthereductioninvariabilityisafactor

€

n .Weneedtoexplainthesuddenadditionofthatlastterm

€

N − nN −1

.

Whenwetalkaboutsamplingfromanormalprobabilitydistribution,orfromanyprobabilitydistribution,weusuallyimplicitlyassumethattheremovalofanitemfromthepopulation(drawnintothesample)doesnotchangethemakeupofthepopulation–justasifthepopulationwereinfinite,or,ifweweresamplingwith

replacement.Butinpollsfromrealpopulations(asopposedtoabstractonesdefinedbyaprobabilitydistribution)wemightneedtoconsiderthatwearereallysamplingwithoutreplacement.Q:Ifwechooseasamplewithoutreplacementfromareal(andfinite)population,andwedecidetosamplethewholepopulation,howvariablewouldthesamplemeanbeinsuchaprocess?Commonsenseanswersthisquestion,butitiscomfortingthattheaboveformulaalsoprovidesthesameanswer!Now,inpoliticalpollsliketheonesmentionedintheintroabove,thesamplesizenismuchsmallerthanthepopulationsizeN.Inthiscase,whathappenstothefactor

€

N − nN −1

?Clearly,itwillbeverycloseto1,andmultiplyingitbythemorefamiliar

partoftheformulawillnotchangeanything.(rewriting

€

N − nN −1

as

€

1− n −1N −1

makesthisevenmoreclear).Aruleofthumbis,whenn/Nislessthan1/20,ignorethefactor.Thefactoriscommonlycalled“thefinitepopulationcorrectionfactor”butperhapsamoredescriptivenameforitwouldbe“thecorrectionfactorforsamplingwithoutreplacement”.Thetwonamesareequivalent–asabitofreflectionreveals.Oneimportantconclusionfromthediscussionisthat,inpoliticalpolling,onecan

ignorethefactor

€

N − nN −1

,andsooneisleftwiththeusualsquarerootrelationship

betweenthepopulationSDandtheSDofthesamplemean.Q:Politicalpollsusuallyendupwithresultsexpressedintermofproportionsofpercentages,whereasmeansareusuallycomputedforquantitativedata(likeincome,importsorGNP).Howdoestheabovediscussionapplytopercentages?Rememberthataproportionisanaverageof0‐1dataandanyproportioncanbeconsideredtobebasedon0‐1data(proportionof1s).Sotheabovediscussionapplies.Oneothertechnicalitemneedstobere‐introducedhere.Apopulationof0sand1shasanSDthatcanbecomputedfromashort‐cutformulawhichdependsonlyontheproportionof1s,p,inthepopulation.

€

SD0−1population =

€

p(1− p)) andsotheSDofasamplemean(insamplingwithreplacement,asismostusual)is

€

SDsample.proportion. ˆ p =

€

p(1− p))n

Notethat

€

ˆ p isanestimatebasedontheproportionof1sinthesample,butpistheactual(andusuallyunknown)proportionof1sinthepopulation.Whenwewant

€

SDsample.proportion. ˆ p anddonotknowthetruepopulationpweusuallypluginour

estimate

€

ˆ p soweestimatethe

€

SDsample.proportion. ˆ p by

€

ˆ p (1− ˆ p ))n

.

Sohereisanexample:Supposeapopulationof10,000,000peopleissampledwithasampleofsize1000.Since1000/10,000,000islessthan1/20weignorethecorrectionfactor.Inoursampleof1000wefind4231sand5770s.Soourestimateofpis

€

ˆ p =423/1000=.423andtheaccuracyofthisestimateisindicatedbythe

€


€

ˆ p (1− ˆ p ))n

=

€

.423(.577))1000

=.016

Wecanreportthisinvariousequivalentways:

1) proportionof1sinthepopulationisestimatedas.423withSD=.0162) percentageof1sinthepopulationisestimatedas42.3%withSD1.6%3) percentageof1sinthepopulationisestimatedasin(39.1%,45.5%)19times

outof20.4) A95%ConfidenceIntervalforthepopulationpercentageof1’sis

(39.1%,45.5%)Q:Whenweplugin

€

ˆ p forpintheSDformulafor

€

ˆ p ,wechangeanexactformulatoanapproximateformula.Isthereamoreconservativeprocedure–whatisthelargestpossiblevalueof

€

p(1− p)) ?Thelargestpossiblevalueof

€

p(1− p)) is½,thevaluewhenp=1/2soaconservative

estimateof

€


€

12

n.

Q:Usingtheconservativeformula,howbigdoesasamplehavetobe(inabigpopulation)toestimatethesampleproportiontowithin3percentagepoints?Wewantthewidthofthe95%ConfidenceIntervalforaproportiontobe.03

percentagepoints,so2times

€

12

nmustbe.03.Thus

€

n mustbe

€

1.03

andnmustbe

atleast1111.So?

Politicalpollsareusuallybasedonafewthousandpeople.Ifyousay“whymorethan1111?”,theansweristhatreal‐worldpoliticalpollsarenotsimplerandomsamplesbutmorecomplicateddesigns,andtogetthesameinformation,thesemorecomplicatedsampleshavetobebiggerthanasimplerandomsample.Buttheadvantageofrandomsamplingisstillanimportantpartofthesemorecomplexdesigns.Q:DidweassumeNormalityintheabovediscussion?(Yes.Thefactor“2”fora95%ConfidenceIntervalisbasedontheNormaldistribution,andthejustificationforitistheCentralLimittheorem,whichsaysthataverages(andproportions)haveapproxNdistributions,withtheapproxgettingbetterandbetterasthesamplesizeincreases.)Q:Whenasurveyrespondentisaskedtheirpoliticalpreference,doestherespondentworryabouttheconfidentialityoftheinformation,andwilltherespondentanswertruthfullywithoutregardtoconfidentiality?A:Yusually,Nnotalways.RandomizedResponseSurveysHowtogetreliableconfidentialinformationwithoutanonymity!Everyoneintheclassusesacointotoss,resultinginHorT.Eachstudentshouldkeeptheoutcomeofthetossprivate.Accordingtotheoutcomeanswereitherquestion1or2byraisingyourhandforananswer“Yes”whenaskedbytheprof.Q1.(tobeansweredifyourtosswasahead).Isyourcoinahead?Q2.(tobeansweredifyourtosswasatail).Areyouaregularuserofmarijuana(usuallyonceperweekormore)?Toraiseyourhandinthissituation,youneedtobeassuredthatyourpeerswillnotknowifyougotaHorT,andsoayescouldmeanthatyoudidgetaheadoritcouldmeanthatyouareansweringYestothemarijuanaquestion.Butnobodyexceptyouwillknowwhichquestionyouareasking.Ididthisin2002withtheSTAT100classofabout100students.ThenIhad60Yesresponses(60handsup).SinceabouthalfofthesewereQ1answerers,onlyabout10oftheother50wereansweringYestoQ2.Sothepercentageofregularmarijuanauserswasestimatedtobe20%(10outof50).Wastheconfidentialityofresponsereallyprotected?Considerthe60answeringYestowhicheverquestiontheyanswered.Thereare5yessestoQ1foreach1yesto

Q2.SoapersonansweringYeshadaprobabilityof5/6or83%tobeansweringtheinnocuousQ1.Q:IsRandomizedResponseTechniquejustaparlourgameoraseriousconfidentialitystrategy?Thegovernmentandmanyotherlargecompanieshavehugedatabasisthattheymaywishtomakeavailabletolegitimateusers,buttheydonotwanttocompromisetheconfidentialityoftheinformationbyprovidingenoughinfoaboutindividualsthatsomeindividualsareidentifiedbytheirdataevenifnamesandaddressesareremoved.Onestrategythatisusedisthatthedataprovidedismessedupwithrandomerrorssothatindividualfilesaremodifiedbuttheaggregateofthefileshasthestatisticalinformationwantedbytheuser.Sovariationsothetechniqueareactuallyused.SamplingStudies Thebasicideainsamplingisthatitisaloteasierandcheapertobaseinformationaboutapopulationonasamplefromthepopulation,ratherthanmeasuretheentirepopulation.And,ifthesampleisarandomsample,oratleastusingadesignbasedonrandomization,thenthesampleitselfprovidesinformationabouttheaccuracyofthesampleasarepresentativeofthepopulation.Aswehaveseenwithpoliticalpolls,whenthepopulationishuge,samplingisaparticularlyefficientwayofgettingpopulationinformation. However,inmorecomplexsituations,samplingstudiescanbeahugechallenge,andareonlyattemptedwhenthegoalisveryworthwhile.Thetworeadingsdescribedbelowdescribethreesamplingstudiesthatsoughtimportantinformation,andinspiteoftheexpertiseinthedesignofthestudies,manypracticalproblemswereencountered.Thediscussionofthegoals,andtheeffortstoovercometheobstacles,isinstructiveforanyonetryingtojudgethevalidityoffindingsfromsuchstudies.Hill:EvaluatingSchoolChoicePrograms(pp6987)Aquestionmanypeoplewouldliketoknowtheanswertois:Areprivateschoolsbetterthanpublicschools?Butthisistoobigaquestion.Onethatmighthaveanansweris“AreprivateschoolsbetteratteachingthethreeRsthanpublicschools”?TheHillarticledescribesaseriousefforttofindananswertothisquestion.Thearticlefirstdiscusseswhythestudiesthathavebeendonesofardonotanswerthequestionsatisfactorily–mainlybecausetheyareobservationalstudies.Oftenexperimentswithhumansubjectsareunethicalandcannotgetapproval.Butthearticledescribesasituationinwhicha“natural”experimentispossiblemeaningthatitispossibletorandomlyassigntheproposed“cause”(privateschoolvspublicschool)totheexperimentalunits(studentsacceptedintothescholarshipprogram)

sothattheotherpossibleinfluenceswouldbebalancedbetweenthetwocomparisongroups(thatis,thegroupofstudentsassignedtotheprivateschoolandthegroupofstudentsassignedtothepublicschool.)Buttherewerepracticalproblemsthatmadethecomparisonabitambiguouseventhoughthedesignofthestudywassetupasan“experiment”.Thereweremissingdata:althoughthestudentsweretobetestedafterthreeyearsintheassignedschoolsystem,notallofthemshowedupforthetesting.Anotherproblemwasnoncompliance:notallstudentswhowereassignedtoonetypeofschoolacceptedthatassignment.Finally,therewerethesubjectivechoicesmademytheevaluatorswhencertainvariablesweredefinedfortheanalysis.Thearticlediscussesthewaysinwhichtheinvestigatorstriedtolessentheimpactofthesedeviationsfromapurelyobjectiveexperiment,andtheynodoubtwereabletolessensomeofthepotentialproblems,butasusualthepracticalproblemsinstudiesofhumansubjectsalwaysleavethefinalinferencetobetosomeextentsubjective.Insofarastheproblemscanbeignored,thefinalresultwasthatthepublicschoolsystemdidseemtohelpthechildrenfromlow‐incomefamilies,butonlytheonesfromAfrican‐Americanfamilies.Thisconclusionleftlotsofroomforfurtherstudies!Heiligetal“LeveragingChanceinHIVResearch”(pp227242)IntheHillstudyonschooltypes,themainproblemwastogetthehumansubjectstodowhattheywereassignedtodo.Butthelistofexperimentalunitswasreadilyavailable.InthisHIVstudythemainproblemwastogetalistofthestudysubjects,namelytheyoung“menwhohavesexwithothermen”inonesurveyand“childrenborntomotherswithHIV”intheothersurvey.IntheYoungMen’sSurvey(YMS),alistofmeetingvenuesforhomosexualmenwasconstructedandarandomsubsetofthesevisitedforinformationcollection.Alleligiblesubjectsattheselectedvenueswereaskedtoprovideinformationforthesurvey.Thetimesofthevisitswerealsorandomized.Theserandomizationstrategiesallowedforscale‐upoftheinformationtotheentirepopulationofinterest.Forparticipantsagreeingtoparticipate,bloodtestingwasdonetodeterminethepresenceofHIV.Table1providessomeresults(p233).Usingtheoddsratio,itwasdeterminedthatblackmeninthisstudypopulationhadasignificantlyhigherriskofHIVthanwhitemeninthesamepopulation.[Noteonterminology:Incidence:numberofnewcasesperyearper1000populationPrevalence:numberofexistingcasesperyearper1000population]IntheHIVNETClinicalTrialinUganda,theaimwastocomparetheeffectivenessoftwodrugs,ZDVandNVP,inpreventionofHIVtransmissionfrommothertoinfant.

Thedesignofaclassicalclinicaltrialincludesthefollowingelements:doubleblindtreatments,includingtheuseofaplacebo,andrandomizationoftreatmentstocases.(Youneedtounderstandthesetermsandwhytheyareimportant.)However,theUgandatrialdidnotusethedouble‐blindstrategy.TheThailandandUS/Francetrialsdidusethedoubleblindstrategy.TheresultsofthethreestudiesaresummarizedinTable3onpage240.Thethreetrialshavecomparableresults:TheNVPdrugissuperiortotheZDVinthiscontext.Onpage239,mentionismadeofamethodof“survivalanalysis”for“censored”data.WewillexplorethistechniqueinWeek9ofthecourse.Reminder:BesuretoaccessAssignment#7fromthecoursewebpage.

week 7 olympics: does medal count depend on population...

Documents