biostatistics workbook aug07-1
DESCRIPTION
Ststistics manualUnderstanding the practicalTRANSCRIPT
-
BiostatisticsWorkbookFieldEpidemiologyandLabTrainingPrograms(FELTP)
DRAFT
DepartmentofHealthandHumanServicesCentersforDiseaseControlandPrevention
CoordinatingOfficeforGlobalHealthOfficeofCapacityDevelopmentandProgramCoordination
DivisionofEpidemiologyandSurveillanceCapacityDevelopment
-
Acknowledgements:
Wethankthefollowingfortheirtimeandeffortsindevelopingthecontentofthisworkbook:
DonnaJonesMichaelA.JosephJenniferScharff
NadineSunderland
ContentReview:
EdmondMaesPeterNsubuga
-
BiostatisticsWorkbook 5DRAFT:Aug.28,2007
TableofContents
HowtoUsethisWorkbook ...........................................................................................6IntroductiontoBiostatistics ..........................................................................................7
ScalesofMeasurement ................................................................................................9FrequencyDistributions ............................................................................................11
CentralLocationandDispersion ................................................................................33MeasuresofCentralTendency...................................................................................34MeasuresofDispersion .............................................................................................41
ProbabilityandtheNormalDistribution ...................................................................52ProbabilityDistribution.............................................................................................53NormalDistribution ..................................................................................................55CentralLimitTheorem ..............................................................................................61
StatisticalInference .....................................................................................................63ConfidenceIntervalAroundaMean ..........................................................................65ConfidenceIntervalAroundaProportion..................................................................77HypothesisTesting:TwoSamplettest ......................................................................85ConfidenceIntervalEstimation:TwoSamplettest ...................................................95HypothesisTesting:ztestforDifferenceinProportions ..........................................106ConfidenceIntervalEstimation:ztestforDifferenceinProportions .......................115HypothesisTesting:Pairedttest .............................................................................125ConfidenceIntervalEstimation:Pairedttest ..........................................................136FishersExactTest ..................................................................................................145ChiSquareTestforIndependence ...........................................................................155
ConfidenceIntervalsforCaseControlandCohortStudies....................................163ConfidenceIntervals:OddsRatiosandRelativeRisks .............................................164
SampleSize ................................................................................................................181SampleSizeforDescriptiveStudies .........................................................................182SampleSizeforAnalyticStudies ..............................................................................191
CorrelationandRegressionAnalysis........................................................................205PearsonProductMomentCorrelationCoefficient...................................................206SimpleLinearRegression ........................................................................................217OneWayAnalysisofVariance(ANOVA) ................................................................223
References..................................................................................................................231Appendix1:AnswerKey ..........................................................................................234Appendix2:DistributionTables...............................................................................243
StudentstTable......................................................................................................244StandardNormalz...................................................................................................245ChiSquareDistribution ...........................................................................................246FDistribution ..........................................................................................................247
-
BiostatisticsWorkbook 6DRAFT:Aug.28,2007
HowtoUsethisWorkbook
Thisworkbookisintendedasaresourceforstudentsinintroductorybiostatisticscourses. Itprovidesstudentswithstepbystepguidancethroughexampleproblemscalculatedbyhandandwithreadilyavailablestatisticalsoftwareprograms.Practiceproblemsaregiven,alongwithananswerkey,sothatstudentsareabletosolidifywhattheyhavelearnedintheirbiostatisticscourses.
Theworkbookmayalsobeusedasareferenceonceastudenthascompletedabiostatisticscourse. Thoughitdoesnotprovidedetailedinformationonthetheoryofbiostatisticalconcepts,itwillserveasarefresherastowhatstatisticaltestshouldbeusedinagivensituationandhowtodothecalculationsthataccompanythattest.
-
IntroductiontoBiostatistics
BiostatisticsWorkbook 7DRAFT:Aug.28,2007
IntroductiontoBiostatisticsThisworkbookprovidesanoverviewofbasicbiostatisticstopicsincludingscalesofmeasurement,centrallocationanddispersion,normaldistribution,testsofstatisticalinference,samplesize,andcorrelationandregressionanalysis.Followingthedescriptionareexamplesandpracticeproblemstobecompletedbothbyhandandwiththeaidofastatisticalcomputerprogram.Theseexamplesandpracticeproblemswillgiveyouanopportunitytoapplytheconceptstosituationsthatyoumayfindinthefield. DatasetsforthepracticeproblemsareeitherincludedintheworkbookorontheaccompanyingCD. Asyoucompletethepracticeproblems,youmaycheckyourworkbyreferringtotheanswerkeylocatedinAppendix1.
Thisworkbookismeantasasupplementaltextandisnotintendedtoreplaceyourregularbiostatisticscourse.However,weallneedafriendlyreminderfromtimetotime.Forthisreason,wehaveincludeddefinitionsofcommonlyusedtermsinbiostatisticsforyourreference.
Data: Therawmaterialofstatistics,datagenerallyconsistsofnumbersofmeasurementorcountsofapopulationsample.Forexample,anursemayrecordthetemperatureofpatients(ameasurement)orcountthenumberofpatientswithatemperatureabovenormal.
Variable: Thetermforacharacteristicthatisdifferentinmembersofapopulationorsample,suchasheight.Thismeasurementisnotconstant,sothereforeitisvariable.Variablescanbequalitativeorquantitative,continuousordiscrete.Randomvariablescannotbepredictedandarethemostusefulforstatisticalpurposes.
Population: Acollectionofentities.Astatisticalpopulationreferstothelargestcollectionofentitiesinwhichwehaveaninterest.Forexample,wemaybeinterestedinlookingatwomenofreproductiveagewhohavehadonechild.Therefore,ourpopulationislimitedtoonlythosewomenaged1545whohaveonechild.
Sample: Partofapopulation.Asampleoftheexamplepopulationofwomen1545withonechildmightconsistofanestimated25percentofthepopulation.
Parameter:Adescriptivemeasurecomputedfromthedataofapopulation.
Statistic: Adescriptivemeasurecomputedfromthedataofasample.Statisticsisafieldwhichexaminesthecollection,organization,summarization,andanalysisofdataanddrawsinferencesregardingthatdataforapopulationthroughobservationofasample.
-
IntroductiontoBiostatistics
BiostatisticsWorkbook 8DRAFT:Aug.28,2007
DescriptiveStatistics: Methodsforpresentingandsummarizingdata.Descriptivestatisticsallowustounderstandgeneralpatternsinalargequantityofdatawithoutconductingaformaltestofahypothesis.
InferentialStatistics:Statisticsusedtoreachaconclusionaboutapopulationbasedoninformationgatheredfromasampleofthatpopulation. Involvesestimationorhypothesistesting.
StatisticalSymbols
:populationmean :populationstandarddeviationx :samplemean s:samplestandarddeviation.50:median
-
FrequencyDistributions
BiostatisticsWorkbook 9DRAFT:Aug.28,2007
ScalesofMeasurement
Therearefourcommonlyrecognizedscalesofmeasurementforvariables.
NominalScaleThenominalscaleclassifiespersonsorthingsbasedonaqualitativeassessmentofthecharacteristicbeingassessed.Itneitherincludesinformationonquantityoramountnordoesitindicatemorethanorlessthan.
Example:Gender(maleorfemale)isacommonnominalvariableusedinepidemiologicstudies.
Example:Countrytelephonecodesareanexampleofnumericvariablesthatdonotindicatemoreorless(countrycode82isnotmorethancountrycode37).
OrdinalScaleTheordinalscalealsoclassifiespersonsorthingsbasedonthecharacteristicbeingassessedbutdoesindicatemorethanorlessthan.Inthissense,itprovidesmoreinformationthanthenominalscale. However,theordinalscaledoesnotindicatehowmuchmorethanorlessthan.
Example:Ratingstudentsperformanceasbeingpoor,average,good,orexcellentindicateshowwellstudentsperformandprovidesabasisforcomparison.However,itdoesnotindicatehowmuchbetteranexcellentperformanceiscomparedtoagoodone.
IntervalScaleTheintervalscalehasthesamecharacteristicsoftheordinalscaleclassifyingpersonsorthingsbasedonthecharacteristicassessedandindicatingmorethanorlessthanbuttheintervalscaleindicateshowmuchmorethanorlessthan.Whattheintervalscaledoesnotdoisindicateatruezeropointmeaningthat
Overview
Scalesofmeasurementallowyoutocategorizedatainordertoprovideinformationaboutthecharacteristicbeingmeasured.
Thetypeofscaleusedinmeasuringdataaffectsthetypeandamountofinformationthatcanbeobtained.Thisaffectshowdatawillbetreatedstatistically.
Recognizingthedifferentscalesofmeasurementandunderstandingtheirimplicationsforanalyzingdatawillalsoassistyouincreatingquestionnairesforepidemiologicstudies.
-
FrequencyDistributions
BiostatisticsWorkbook 10DRAFT:Aug.28,2007
therecannotbeanabsenceofacharacteristicbeingmeasured. Additionally,ratiosmadewithtwonumbersintheintervalscaledonothavemeaning.
Example:Temperatureisanintervalinthatdifferentvaluescantellyouhowmuchmoreorless.However,thereisnotruezeropoint.Thevalueofzerointemperaturedoesnotindicateabsenceoftemperature. Also,whencomparingtwotemperatures,theirratioisnotmeaningful.Wewouldnotsaythata90degreetemperatureistwiceashotasa45degreetemperature.
RatioScaleTheratioscaleincludesallthecharacteristicsoftheintervalscalebutdoesindicateatruezeropoint.
Example:Heightandweightmeasurementsindicatehowmuchmoreorless,butalsohaveatruezeropoint.Aweightofzeroindicatesanabsenceofweight.
ScalesofMeasurement:SUMMARY
Nominal Ordinal Interval Ratio Classifiespersons
orthingsbasedonaqualitativeassessment
Similarordissimilarbutnotmoreorless
Canbenumericbutnothereisnoimplicationofmoreorless
Classifiespersonsorthingsbasedonaqualitativeassessment
Moreorlessbutnothowmuchmoreorless
Indicateshowmuchmoreorless
Doesnotcontainatruezeropoint
Cannotcreatemeaningfulratiosofthesetwonumbers
Includesallthecharacteristicsoftheintervalscale,butcontainsatruezeropoint.
Practice:ScalesofMeasurementIdentifythescaledescribedineachsituationbelow:
1. Temperatureofpatientsatahealthfacility2. Theweightofchildrenunderfiveataweeklybabyweighing3. Thereligionoffamiliesinavillage4. Thelengthoftimespentinthehospital5. Thediagnosisofpatientsuponadmissiontothehospital
RelatedConcepts
FrequencyDistribution
-
FrequencyDistributions
BiostatisticsWorkbook 11DRAFT:Aug.28,2007
FrequencyDistributions
Oneofthemostcommonwaystosummarizedataforbetterunderstandingandclearerpresentationisthroughafrequencydistribution.Afrequencydistributionisapresentationofthenumberoftimes(orthefrequency)thateachvalue(orgroupofvalues)occursinthestudypopulation.
Afrequencydistributionhelpstogiveapictureoftheshapeofthedistributionofthedata. Dataisunimodalifitonlyhasonepeak,bimodalifithastwopeaks,andmultimodaliftherearemorethantwopeaks.Measuresofdispersionwillhelpyoutoform aclearerpictureofthedistributionofthedatabydescribingtheheight,orthespread,ofthedata.Wewilldiscussthisinmoredetailinthesectiontitled,MeasuresofDispersion.
Afrequencydistributioncanbedisplayedasatable,abarchart,ahistogram,orafrequencypolygon. Eachmethodshouldbeclearlylabeledwiththefrequencynumber. Themethodusuallydependsonthetypeofvariablebeingdescribed.
Overview
Frequencydistributionsshowhowofteneachvalueforavariableoccursinasampleorpopulation.
Example:Malariacasesmaybereportedonafrequencybymonthbasisinordertodeterminethehighriskmonthsintheyear.
-
FrequencyDistributions
BiostatisticsWorkbook 12DRAFT:Aug.28,2007
Categoricalvariablesarequalitativeinnatureandarebestdisplayedasatableorabarchart.
TableAfrequencytablesimplyshowsthenumberoftimeseachspecificobservationappearsinasampleorpopulation.
CasesofMalaria
Frequency
Monday 6Tuesday 4Wednesday 2Thursday 5Friday 3Saturday 4Total 24
BarchartAbarchart,likeatable,displaysthenumberofobservationsforeachvariable,butprovidesabettervisualrepresentation.
CasesofMalaria
0
1
2
3
4
5
6
7
Monday
Tuesday
Wednesday
Thursday
Friday
Saturda
y
Frequen
cy
-
FrequencyDistributions
BiostatisticsWorkbook 13DRAFT:Aug.28,2007
Numericalvariablesarequantitativeinnatureandarebestdisplayedasafrequencyhistogramorafrequencypolygon.
FrequencyhistogramAfrequencyhistogramshowsthefrequenciesrelativetoeachother.Thewidthofthebarisinproportionwiththeclassintervalthatitrepresents.Typicallytherearenospacesbetweenbarsinafrequencyhistogram,thoughyoumayseethemconstructedinthisfashionattimes.
FrequencyofMalariaCasesinthePastYear
0
5
10
15
20
25
0 1 2 3 3+
NumberofCases
Peo
ple
-
FrequencyDistributions
BiostatisticsWorkbook 14DRAFT:Aug.28,2007
FrequencypolygonAfrequencypolygonincludesthesameareaunderthelinethatahistogramdisplayswithinthebars. Eachpointrepresentsamidpointinthedata.Thoughafrequencypolygonmaylooklikealinegraph,afrequencypolygonmustbeclosedattheends.
FrequencyofMalariaCasesinthePastyear
0
5
10
15
20
25
. 0 1 2 3 3+ .
NumberofCases
Peo
ple
Numericalvariablesmayneedtobegroupedforpresentationifthenumberofvaluesis largeoritisacontinuousvariable.Theboxbelowgivesguidelinesonhowtogroupvariables.
-
FrequencyDistributions
BiostatisticsWorkbook 15DRAFT:Aug.28,2007
RelativeFrequency
Oftenitisusefultoknowtheproportionofthevaluesthatfallwithinaspecificcategoryorgroup.Thisisobtainedbydividingthenumberofvaluesatthatcategorybythetotalnumberinthesample.Thisisreferredtoastherelativefrequencyandispresentedasaproportion(valuesfrom0.0to1.0)orapercent(valuesfrom 0%to100%).
Whenreportingeitherthefrequencyortherelativefrequencyintableorgraphform,makesurethatalldataisclearlylabeled.
CasesofMalaria
Frequency Percent CumPercent
Monday 6 25.0 25.0Tuesday 4 16.7 41.7Wednesday 2 8.3 50.0Thursday 5 20.8 70.8Friday 3 12.5 83.3Saturday 4 16.7 100.0Total 24 100.0 100.0
Inthetableabove,therelativefrequencyispresentedasapercentofthewhole.
GroupingVariables
Continuousnumericvariablesmustoftenberegroupedintocategoriesforanalysispurposes.Listedbelowaresomegeneralguidelinestousewhengroupingvariables:
Createclassintervalsthataremutuallyexclusiveandincludealldata.Itshouldbeclearwhereoneintervalstopsandthenextonebegins.Nointervalshouldincludethesamenumbertwice.
Usealargenumberofnarrowclassintervalsfortheinitialanalysis.Allintervalsshouldbethesamesize.Youcancombineintervalslaterifneeded,butitisimpossibletobreakintervalsdownfurtherwithoutreferringbacktotheoriginaldata.
Usenaturalormeaningfulgroupingswhenpossible.Therearemanygroupings,suchasfiveyearageintervalsandbodymassindex(BMI),whichareusedfrequentlyand,therefore,havebecomestandard.SomegroupingshavebeenestablishedbyorganizationssuchasWHOorCDC.
Createaseparatecategoryforunknowns.Thiswillavoidconfusionwhencomparingsubgroupobservations(n)tothetotalnumberofobservations(N).
-
FrequencyDistributions
BiostatisticsWorkbook 16DRAFT:Aug.28,2007
StepbyStepExample:FrequencyDistributionsUsethedatabelowtocreatefrequencydistributions. Thismightrepresentaclassofmastersstudents.First,createafrequencytableforGender,thendisplaythesameinformationinabarchart.Next,createahistogramofNumberofchildren. Also,displaythisinformationinafrequencypolygon.
Subject Gender Age Numberofchildren
MaritalStatus*
1 M 32 1 M2 M 35 0 M3 F 28 0 S4 M 45 3 D5 F 47 3 M6 F 36 2 D7 M 29 1 S8 M 31 0 S9 F 42 2 D10 F 44 2 M*M=married,S=single,D=divorced
Step Example1. Createafrequency
table.DeterminethenumberofobservationsforeachvariableunderGender.Displaythisinatable.
Gender FrequencyFemale 5Male 5
2. Createabarchart. DisplaythefrequencyoftheobservationsforGenderinabarchart.
GenderofParticipants
0
1
2
3
4
5
6
Male FemaleGender
Frequen
cy
-
FrequencyDistributions
BiostatisticsWorkbook 17DRAFT:Aug.28,2007
Step Example3. Createahistogram. Displaythefrequencyoftheobservationsfor
Numberofchildreninahistogram.
NumberofChildrenofParticipants
0
0.5
1
1.5
2
2.5
3
3.5
0 1 2 3
NumberofChildren
4. Createafrequencypolygon.
DisplaythefrequencyforNumberofChildrenasapolygon.
NumberofChildrenofParticipants
0
0.5
1
1.5
2
2.5
3
3.5
. 0 1 2 3 .
Children
5. Describethedata. Thereareanequalnumberofmenandwomenparticipatingintheconference. Thefrequencydistributionshowsthatthevariablechildrenisbimodalinnature.Themajorityofparticipantshaveeithernochildrenortwochildren.
-
FrequencyDistributions
BiostatisticsWorkbook 18DRAFT:Aug.28,2007
Practice:FrequencyDistributionsUsingthefollowingdataset,createvisualrepresentationsofthefrequencydistributionsforthevariables.
Subject Gender Age Numberofchildren
MaritalStatus
1 M 32 1 M2 M 35 0 M3 F 28 0 S4 M 45 3 D5 F 47 3 M6 F 36 2 D7 M 29 1 S8 M 31 0 S9 F 42 2 D10 F 44 2 M
1. Createafrequencytableforthevariable,MaritalStatus.(Includethecumulativepercent.)
2. Showthesameinformationinabarchart.3. Drawafrequencyhistogramforthevariable, Age.Grouptheagesin
intervalsoffivebeforebeginning.4. Displaythesameinformationinafrequencypolygon.
Spacehasbeenprovidedonthefollowingpagestocompleteyourwork.
-
FrequencyDistributions
BiostatisticsWorkbook 19DRAFT:Aug.28,2007
Step PracticeSpace1. Createafrequency
table.
2. Createabarchart.
-
FrequencyDistributions
BiostatisticsWorkbook 20DRAFT:Aug.28,2007
Step PracticeSpace3. Createahistogram.
4. Createafrequencypolygon.
5. Describethedataset.
-
FrequencyDistributions
BiostatisticsWorkbook 21DRAFT:Aug.28,2007
EpiInfoExample:FrequencyDistributionsYouareattendingafictitiousinternationalconference.Demographicdatawascollectedontheattendees.Usewhatyouknowaboutfrequencydistributiontosummarizethedata. First,createatableandabarchartofthecategoricalvariable,Occupation.Then,createahistogramandafrequencypolygonforthecontinuousnumericalvariable,Weight_kg. ThedatasetiscalledFrequency_DistandisfoundintheBios_Workbook_Examples.mdbdatabase.
FrequencyTable
Step Example
1. READthedataset. OpenEpiInfoandchooseAnalyzeData.
SelectREADunderDataAnalysisCommands.
OpenFrequency_Distinthedatabase,Bios_Workbook_Examples.mdb.
2. Createafrequencytable.
SelecttheFREQUENCIEScommand.
IntheFrequencydropdownbox,highlightthevariablethatyouwanttoexamine.Forthisexample,highlightOccupation.
ClickOK.
3. Describethedata. Youshouldseeafrequencytableonyourscreenthatlooksliketheonebelow:
Thischartprovidesinformationonthevariableoccupationbypresentingfrequenciesandrelativefrequencies.
-
FrequencyDistributions
BiostatisticsWorkbook 22DRAFT:Aug.28,2007
BarChart
1. MakeafrequencybarchartinEpiInfo.
ChooseGRAPHunderStatistics.
IntheGraphTypedropdownbox,chooseBar(default).
Intheboxlabeled1stTitle|2ndTitle,typeOccupationofParticipants.Thisisthetitleofyourchart.
UnderXAxis,chooseOccupationastheMainVariable.
UnderYAxis,ShowValueofCount.(default)
ClickOK.
-
FrequencyDistributions
BiostatisticsWorkbook 23DRAFT:Aug.28,2007
2. Describethedata. EpiInfowillgiveyouthegraphbelow:
Noticethatthegraphrepresentstheexactnumberslistedinthetablecreatedpreviously.
YoucanmakeabarchartofthepercentageofparticipantsineachoccupationbychoosingShowValueofCount%underYAxis.
-
FrequencyDistributions
BiostatisticsWorkbook 24DRAFT:Aug.28,2007
Histogram
1. MakeahistograminEpiInfo.
ChooseGRAPHunderStatistics.
UnderGraphType,chooseHistogram.
Createatitleforyourgraph.
ChooseWeight_kgasthemainvariableandShowValueofCount.
NoticewhenyouselectHistogramastheGraphType,youaregiventheoptiontocreateintervals.ThisallowsyoutogroupthevariableWeight_kg,withoutcreatinganewvariable.UsingtheIntervalsoptionmakesthedataeasiertoview.IfyoucreateaFREQUENCIEStableyoucanseethattherearenearly50differentweightsrecorded.Itmaynotbeusefultohaveeachonelistedseparately.
Tocreateintervals,lookatthecolumnmarkedXAxis.Type5inthefirstspaceunderIntervalType45inthespaceunderFirstValue.
ClickOK.
2. Describethedata. Nowthegraphyouseewillpresenttheweightofparticipantsin5kgintervals.
-
FrequencyDistributions
BiostatisticsWorkbook 25DRAFT:Aug.28,2007
EpiInfoPractice:FrequencyDistributionsUsethedatasetfromthefictitiousconference(Frequency_Dist)onceagaintocreatefrequencydistributionsforHeight_cmandPreferredLanguageinEpiInfo.
1. CreateafrequencytableofPreferredLanguageinEpiInfo.
2. MakeafrequencybarchartofPreferredLanguageinEpiInfo.
3. MakeahistogramofHeightinEpiInfo.
Revieweachofthesedisplaysanddescribethedataset.
Step PracticeSpace
4. Describethedatasetusingthefrequencychartsandgraphsthatyouhavecreated.
ExcelExample:FrequencyDistributionsNowuseExceltocreateafrequencypolygonforthecontinuousnumericalvariable,Weight_kg.ThedatasetiscalledFrequency_DistandisfoundintheBios_Workbook_Examples.mdbdatabase.
1. CreateafrequencypolygoninExcel.
a.OpenExcelandimportthedataset.
Fromthetoolbar,selectData.HighlightImportExternalData.ChooseImportData.LocateFrequency_DistintheBios_Workbook_Examples.mdbdatabase.ClickOpen.
ThedatasetshouldappearasanExcelspreadsheet.
-
FrequencyDistributions
BiostatisticsWorkbook 26DRAFT:Aug.28,2007
b.CreateafrequencytableforWeight_kg.
CopythevariableWeight_kgbyhighlightingthecolumn.PressCtrl+Ctocopy.ChooseablankcellonthespreadsheetandpastethevariablebypressingCtrl+V.
Inthecellnexttothevariableheading,typeInterval.Completethecolumnbyenteringtheintervalsthatyouhavechosenforthedata.Inthiscase,createintervalsof5,beginningwith4549andcontinuinguntil100104.Youshouldanchortheintervalsbyincluding=105.Thefirstandlastintervalsshouldhaveafrequencyofzero.
ThenextcolumnwillbetitledBin. BinisawordusedbyExceltodefineintervallimits. Inthiscolumn,wetellExcelhowtoreadtheintervalsthatwehavecreated.ThefirstnumberinthebinarraywilltellExceltofindallobservationslessthanorequaltothatnumber,n.Thesecondnumber,p,willtellExceltolocateallobservationsthatoccurbetweenn+1andp.Thiscontinuesuntilthefinalnumberinthebin,whichtellsExceltolocateallnumbersgreaterthanorequaltothatfinalnumber.
Createthebinbytypinginthehighestnumberthatshouldbeincludedinthatinterval.Forthefirstnumberinthebin,Excelwilllookforallobservationslessthanorequaltothatnumber.Forthelastnumberinthebin,Excelwillfindobservationsgreaterthanorequaltothatnumber.
Weight(kg) Intervals BIN Frequency
73 =105 105
-
FrequencyDistributions
BiostatisticsWorkbook 27DRAFT:Aug.28,2007
677587
YourfinalcolumnwillbecalledFrequency.WewillletExcelcalculatethefrequenciesforus.
HighlighttheFrequencycolumnbyclickingonthefirstcellundertheheadinganddraggingthemouseuntiltheshadedareaequalsthelengthoftheBincolumn.Donotincludethecolumnlabel(Frequency)whenhighlighting.
UnderInsertinthetoolbar,chooseFunction.SelectthefunctionFREQUENCY.Youmayhavetodoasearchforthefrequencyoptionbytypingthewordfrequencyattheprompt.
ClickOK.
Youwillseethefollowingbox:
-
FrequencyDistributions
BiostatisticsWorkbook 28DRAFT:Aug.28,2007
ClickonthecharticontotherightoftheboxlabeledData_array.HighlightallthevaluesforthevariableWeight_kg.
Clickonthecharticonagaintoreturntothefunctionbox.
-
FrequencyDistributions
BiostatisticsWorkbook 29DRAFT:Aug.28,2007
c.Createafrequencypolygon.
ClickonthecharticontotherightoftheboxlabeledBins_array.HighlightallthevaluesintheBincolumn.Clickonthecharticonagaintoreturntothefunctionbox.
PressControlandShifttogetherandhitEnterwhilecontinuingtoholdtheothertwokeysdown.(DONOTCLICKOK!)
Thenumberofobservationsincludedineachintervalwillbeshowninthechart.Younowhaveafrequencytable.Notethatthereisafrequencyofzeroatthehighendandatthelowendoftheweightintervals.Youwillneedthisinordertocreateafrequencypolygoncorrectly.
Usingthefrequencytablethatyoujustmade,highlightallthevaluesinthefrequencycolumn.
UnderInsertinthetoolbar,selectChart.
ChooseChartType:Line.Thefirstlinegraphinthesecondrowispreferredbecauseitshowsthemidpointsinthegraph.
ClickNext.
Afrequencypolygonwillappear.
-
FrequencyDistributions
BiostatisticsWorkbook 30DRAFT:Aug.28,2007
Tocorrectlylabelthepolygon,choosetheSeriestab.
ClickthecharticonnexttotheboxlabeledCategory(X)axislabels.
Highlightthevaluesinthecolumn,Intervals.
Yourchartshouldnowbelabeledsimilartotheonebelow:
ClickNext.
ChooseTitletogiveyourchartatitleandlabeltheXaxis.
ClickFinish.
-
FrequencyDistributions
BiostatisticsWorkbook 31DRAFT:Aug.28,2007
2. Describethedata.WeightofConferenceParticipants
0
1
2
3
4
5
6
7
8
9
=105
Weightinkg
Thisdistributionisunimodalbecauseonepeakishigherthantherest.Themajorityofparticipantsweightsfalltotheleftofthepeak.Mostparticipantsweighlessthan84kg.
ExcelPractice:FrequencyDistributionsUsethedatasetfromthefictitiousconference(Frequency_Dist)tocreateafrequencypolygonforHeight_cminExcel.
1. CreateafrequencypolygonofHeightinExcel.
Useyourgraphtoanswerthefollowingquestions.
Step PracticeSpace
2. Describethedatasetusingthefrequencypolygon.
-
FrequencyDistributions
BiostatisticsWorkbook 32DRAFT:Aug.28,2007
3. HowisthissimilartothehistogramthatyoucreatedinEpiInfo?
RelatedConcepts
CentralLocationandDispersion
-
CentralLocationandDispersion
BiostatisticsWorkbook 33DRAFT:Aug.28,2007
CentralLocationandDispersion
Measuresofcentrallocationanddispersionaregenerallyreferredtoasdescriptivestatisticsbecausetheydescribethedistributionofthedataset.
Frequencydistributionprovidesapictureofthenumberoftimesthatavariableoccurs,butrevealsnothingaboutthespreadofthedata. Inordertogainaclearerpictureofhowdataisdistributed,wewillcalculate:
Measuresofcentraltendency:mean,median,mode,range Measuresofdispersion:variance,standarddeviation,andstandarderror
Throughthesemeasures,thedatabeginstotakeshape.Whencombinedwithfrequencydistribution,wecanvisualizethedistributionofthedata. Weobtainthenumberandheightofthepeaksinthedistributionfromthefrequency.Measuresofdispersionallowustoobtainanideaofthewidth,orthespreadofthedistributionofthedata.
Datacanbeeithersymmetricorskewed.Ifthedatacanbedividedintopiecesthatareverysimilartoeachother,wecansaythatthedataissymmetric.Ifonetailofaunimodaldistributionislongerthantheothertail,thenthedataisskewed,meaningthatthedataisnotspreadevenly.Datacanbeeitherrightskewedorleftskewed. Ifdataisskewedtotheright,itwillrisequicklytoapeakandhavealongtailontheright.Theoppositeistruefordatathatisskewedtotheleft.
-
CentralLocationandDispersion
BiostatisticsWorkbook 34DRAFT:Aug.28,2007
MeasuresofCentralTendency
MeanThemeanissimplythearithmeticaverageofthedataandiscalculatedbytakingthesumofallvaluesinthenumbersetanddividingthattotalbythenumberofvaluesinthedataset. Themeanisthemostcommonlyusedmeasureofcentraltendency.
n
xx =
MedianThemedianisthe50thpercentileofthevaluesinadatasetandrepresentstheliteralmiddleofthedata.Themedianisfoundbyarrangingallvaluesinthedatasetinnumericalorderandthenchoosingthemiddlevalue. Ifthenumberofvaluesinadatasetiseven,takethemeanofthetwomiddlenumberstofindthemedian.
ModeThemoderepresentsthevaluethatisfoundmostfrequentlyinasetofnumbers.Notethatitispossibletohavemorethanonemode. Inthefollowingsetofnumbers,{87889656467},themodeisboth8and6,sinceeachisincludedinthedatasetthreetimes. Thisdatasetisreferredtoasbimodalbecauseithastwomodes. Itisalsopossiblenottohaveamodeinasetofnumbers.Inthefollowingsetofnumbers,{5497638},thereisnonumberwhichoccursmorefrequentlythananyother.Therefore,thereisnomode.
Overview
Measuresofcentraltendencyareusedtodescribethedatainthesamplebygivinganideaofthecenterandthedistributionofthedata.
Therearethreecommonmeasuresofcentraltendency:mean,medianandmode.
Formula:Forinstance,thearithmeticmeaniscalculatedasfollows:
n
xx =
-
CentralLocationandDispersion
BiostatisticsWorkbook 35DRAFT:Aug.28,2007
Comparisonofmean,median,andmodeWhenyouaretoldtoaveragethedata,itisgenerallyexpectedthatyouwilltakethemean.Technically,however,theaveragecouldrefertothemean,themedian,orthemodeofthedata.Themeanisabletogiveusthemostinformationaboutthedatasetasawhole,especiallywhencombinedwiththestandarddeviation.Therefore,weprefertousethemeanwhenwecan.
Therearecertainadvantagestothemedian. Themedianisresistanttoskewing,theresultofanoutliercausingthemeanofthedatatoshifteithertotheleftortotheright. Itisnotaffectedbyextremevalueslikethemeanisanditismorerepresentativeofthecenterofdatawhendataisasymmetrical.
Letsconsiderskeweddata.LookatthegraphofthepopulationdistributionbystateintheUnitedStates.
PopulationoftheUnitedStatesbyState
0
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
30,000,000
35,000,000
40,000,000
.Califo
rnia
.Tex
as
.New
York
.Florid
a.Illinois
.Pen
nsylva
nia
.Ohio
.Michiga
n.G
eorgia
.New
Jerse
y.NorthCarolina
.Virg
inia
.Mas
sach
usetts
.Was
hing
ton
.Indian
a.Ten
nessee
.Ariz
ona
.Misso
uri
.Marylan
d.W
isco
nsin
.Minne
sota
.Colorad
o.Alaba
ma
.Lou
isiana
.Sou
thCarolina
.Ken
tuck
y.O
rego
n.O
klah
oma
.Con
necticut
.Iowa
.Mississippi
.Arkan
sas
.Kan
sas
.Utah
.Nev
ada
.New
Mex
ico
.Wes
tVirg
inia
.Neb
rask
a.Id
aho
.Maine
.New
Ham
pshire
.Haw
aii
.Rho
deIs
land
.M
ontana
.Delaw
are
.Sou
thDak
ota
.Alask
a.NorthDak
ota
.Vermon
t.Districto
f.W
yoming
State
Population
Thestatesappearingontheleftsideofthehistogramhaveasignificantlylargerpopulationthanotherstates.Becauseofthis,weexpectthemeantobehigherinvaluethanthemedian.Thecalculatedmeaninthissampleis5,811,968.706,whichisjustmarkedonthegraphabove.Themedianis4,173,405,alsomarkedonthegraph. Themeaninthisexampleisgreaterthanthemedian. Ageneralruletofollowisthatifthedataisskewedeithertotheleftortotheright,themedianrepresentsthedatabetterthanthemean. Ifasampleisnormallydistributed,themeanandmedianwillbenearlythesame.Withsymmetricaldata,themodewillbesimilaraswell.
Mean Median
UnitedStatesPopulationbyState
-
CentralLocationandDispersion
BiostatisticsWorkbook 36DRAFT:Aug.28,2007
Whenthesamplesizeissmall,themodemayrepresentthedatamostaccurately. Itispossiblethatinbimodaldata,themodeswillbeamoreaccuratedescriptionaswell.Themodeisalsofrequentlyusedtodescribequalitativedata.Forexample,youmightfindamodaldiagnosis,orusethemodetodescribemedicaldiagnosesbystatingthediagnosisthatwasseenmostfrequentlyoveragivenperiodoftime.
StepbyStepExample:Mean,Median,ModeThefollowingareagesofpatientsseenbythedoctorforabrokenboneinthepastmonth:
15 17 20 14 16 15 17 22 18 13 15 14 16 18 20
Usethedatatoanswerthefollowingquestions:
Whatisthemeanageofthepatients?Whatisthemedianageofthepatients?Whatisthemodalageofthepatients?Whichmeasureisthemostrepresentativeofthesample?
Step Example1. Findthe
mean, x ,ofthesample.
x =n
x =
15201816141513182217151614201715 + + + + + + + + + + + + + + =
15250
=16.7
2. Findthemedianofthesample.
Firstlinethenumbersupinnumericalorder:131414151515161617171818202022
Findthemiddlenumber:131414151515161617171818202022
Thereare7numbersoneithersideofthearrow,thus16isthemedian.
3. Findthemodeofthesample.
131414151515161617171818202022
Thenumberthatappearsmost,atthreetimes,inthisdatasetis15.Therefore,15isthemode.
-
CentralLocationandDispersion
BiostatisticsWorkbook 37DRAFT:Aug.28,2007
Step Example4. Which
statisticismostrepresentativeofthecenterofthedataset?
Inthiscase,themeanandthemedianarenearlyequal.Therefore,wecanassumethatthecurveisnormallydistributedandthemeanrepresentsthecenterofthecurve.Ifthemeanandthemedianaredifferent,wecanassumethatthedataisskewedandthemedianwillgenerallybemoreappropriate.
Practice:Mean,Median,ModeInordertodetermineifthereisarelationshipbetweenageandthenumberofvisitstothedoctor,youdecidetocountthenumberofdoctorvisitsthatindividualsmakeoverthecourseofayear.Belowisthedatathatyouhavecollected:
Individual Age Visits1 45 152 60 83 52 224 46 95 23 26 52 157 37 38 33 13
Describetheaverageageofyoursampleandtheaveragenumberofdoctorvisitsmadebyanindividualusingthemean,median,andmode.
Step PracticeSpace1. Findthemean, x .
x =n
x
2. Findthemedian.
-
CentralLocationandDispersion
BiostatisticsWorkbook 38DRAFT:Aug.28,2007
Step PracticeSpace3. Findthemode.
4. Whichstatisticismostrepresentativeofthecenterofthedatasetandwhy?
EpiInfoExample:Mean,Median,ModeUsingthesamedatathatwepracticedwithbeforeonpage36,wecanfindthemean,median,andmodeintwosimplestepsusingEpiInfo.
Step Example1. UseEpiInfoto
determinedescriptivestatistics.
a. READthedataset.
OpenEpiInfoandchooseAnalyzeData.
SelectREADinDataAnalysisCommands.
HighlightCentral_TendencyfromtheDataSourceBios_Workbook_Examples.
ClickOK.
b. FindtheMEANSofthedata.
SelectMEANSfromtheCommandscolumnunderStatistics.
ChooseAgefromthedropdownboxunderMeansof.
ClickOK.
-
CentralLocationandDispersion
BiostatisticsWorkbook 39DRAFT:Aug.28,2007
Step Example2. Identifythemean,
median,andmodeofthedata.
Thisistheoutputthatyoushouldsee:
Theoutputgivesyouthemean,themedian,andthemode.EpiInforeportsthemeantobe16.7,themediantobe16.0,andthemodetobe15.0.Thisdoesnotdifferfromthehandcalculationsthatweperformedpreviously.
3. Interprettheresults.
Aswedeterminedearlier,themeanandthemedianarenearlyequal. Therefore,wecanassumethatthecurveisnormallydistributedandthemeanrepresentsthecenterofthecurve.Ifthemeanandthemedianaredifferent,wecanassumethatthedataisskewedandthemedianwillgenerallybemoreappropriate.
EpiInfoPractice:Mean,Median,ModeYouareweighingbabiesfrom9AMto11AMatanunderfiveclinicinthevillage.Yourresultsareasfollows:
Age(months)
Length(cm)
Weight(kg)
21 77 9.834 87 11.523 84 10.830 92 14.027 85 12.024 82 10.831 87 11.626 85 11.822 85 12.432 86 12.0
UseEpiInfotofindthemean,median,andmode. Then,answerthequestionsthatfollow. ThedatasetyouareworkingfromiscalledBabyWeighing.RemembertoopenthedatasetinEpiInfobyusingtheREADcommand.
-
CentralLocationandDispersion
BiostatisticsWorkbook 40DRAFT:Aug.28,2007
Step PracticeSpace1. Identifythemean,
median,andmodeofthedata.
Length: Weight:
Mean______ Mean______
Median_____ Median_____
Mode______ Mode______
2. Whatistheaveragelengthandweightofbabiesthatcameintothecliniconthismorning?
3. Whatcanyoudetermineaboutthedistributionofthedatabasedonyourresults?
MeasuresofDispersionRelatedConcepts
MeasuresofDispersionNormalDistribution
-
CentralLocationandDispersion
BiostatisticsWorkbook 41DRAFT:Aug.28,2007
MeasuresofDispersion
Intheprevioussection,wediscussedmethodsofdescribingthecenterofthedata.Nowwewanttoexaminewaystodescribethespreadofthedata,orhowfareachdatapointisfromthecenter.
Range:Therangeofthedataisthedifferencebetweenthesmallestobservation(minimumvalue)andthelargestobservation(maximumvalue)inasetofdata.Therangeiscalculatedbyfindingthedifferencebetweenthemaximumvalueandtheminimumvalueinasetofdata.
range=maximum minimum
InterquartileRange(IQR): Theinterquartilerangeisthedifferencebetweenthe25thpercentile(1stquartile)andthe75thpercentile(3rdquartile)inasetofdata.Thismeasurementgivesanideaofthemiddle50percentoftheobservationsandis,therefore,lesslikelytobeinfluencedbyoutliersorextremevalues.
IQR4
)1n(4
)1n(3 + -
+ =
Overview
Measuresofdispersiondescribevariabilityofdatainasamplebydescribingthespreadofthedata.
Formulas:Range=maximum minimum
InterquartileRange=4
)1n(4
)1n(3 + -
+ =
Variance= 2in
1i
2 )xx()1n(
1s -
- = S
=
OR)1n(n
)x(xn 2i2i
- -
Standarddeviation= 2ss =
Standarderror=n
sSE =
-
CentralLocationandDispersion
BiostatisticsWorkbook 42DRAFT:Aug.28,2007
Variance(s2): Thevariancerepresentstheamountofspreadorvariabilityaroundthemeanofasetofdata. Becausethevarianceisinunitssquared,wefindthestandarddeviationtodescribeourdataintheproperunits. Thesymbols2 isusedwhenwearereferringtothevarianceofasampleandthesymbol2
(pronouncedsigmasquared)whenwearereferringtothevarianceofapopulation.
2i
n
1i
2 )xx()1n(
1s -
- = S
=
OR)1n(n
)x(xn 2i2i
- -
StandardDeviation(s): Thestandarddeviationofasetofdataisthesquarerootofthevariance. Itdescribestheaveragedistanceofallobservationsfromthemeanofthesampleandisusedasvariabilitytodescribethespreadofthedata.Alargestandarddeviationrepresentsawidespreadbecausetheobservationsarefarfromthemean. Whenwerefertothestandarddeviationofapopulation,weusethesymbol(sigma).
2ss =
StandardError(SE): Thestandarderroristhestandarddeviationofthesamplingdistributionofthemeans,ratherthantheobservationsthemselves.Thesmallerthestandarderror,thecloseranygivensamplemeanislikelytobetothetruepopulationmean.
n
sSE =
StepbyStepExample:MeasuresofDispersionUsingthedatabelow,followtheinstructionstoidentifythemeasuresofdispersionforAge.
Individual Age Visits1 45 152 60 83 52 224 46 95 23 26 52 157 37 38 33 13
-
CentralLocationandDispersion
BiostatisticsWorkbook 43DRAFT:Aug.28,2007
Minimum,maximum,andrange
Step Example1. Identifytheminimum
valueofAge.Theminimumvalueisthelowestvalueinthesample.Inthiscase,itis23.
2. IdentifythemaximumvalueofAge.
Themaximumvalueisthehighestvalueinthesample.Inthiscaseitis60.
3. DeterminetherangeofAge.
maxmin=range
6023=37
37istherangeofthesample.
4. Stateyourconclusions.
TheobservationsinAgecoverarangeof37years.
InterquartileRange
Step Example1. Arrangeobservations
ofthevariableAgeinorderofincreasingvalue.
1)232)333)374)455)466)527)528)60
2. Findthepositionofthe1st (Q1)and3rd
(Q3)quartiles.
4)1n(
Q1 +
= 4
)1n(3Q3
+ =
25.2=4
)1+8(=Q1
75.6=4
)1+8(3=Q3
-
CentralLocationandDispersion
BiostatisticsWorkbook 44DRAFT:Aug.28,2007
Step Example3. Locateeachnumber
indicatedinthedataset.
Q1,withapositionof2.25,isonefourthofthewaybetweenthe2ndand3rdobservationsintheset.The2ndvalueis33andthe3rd is37,so
34133)3337(41
331 = + = - + = Q
Q3,withapositionof6.75,isthreefourthsofthewaybetweenthe6thand7thobservationsintheset.The6thvalueis52andthe7thvalueisalso52.Therefore,Q3=52.
4. FindthedifferencebetweenQ1andQ3todeterminetheinterquartilerange.
Q3Q1=IQR
Q1=34Q3=52
5234=18
5. Stateyourconclusions.
The50thpercentileofthedatahasarangeof18.ThismeansthatthemiddlehalfofalltheobservationsinAgeisspreadacross18years.
Variance,standarddeviation,andstandarderror
Step Example1. Findthemeanof
thedataset.1)232)333)374)455)466)527)528)60
5.43=8
348=
860+52+52+46+45+37+33+23
=x
-
CentralLocationandDispersion
BiostatisticsWorkbook 45DRAFT:Aug.28,2007
2. Calculatethevarianceusingtheformulabelow.
2i
n
1i
2 )xx()1n(
1s -
- = S
=
])5.4360(+)5.4352(2+)5.4346(+)5.4345(
+)5.4337(+)5.4333(+)5.4323[()18(
1=s
2222
2222
]25.272+)25.72(2
+25.6+25.2+25.42+25.110+25.420[71
=s2
99871
=s2
57.142s2 =
3. Calculatethestandarddeviation.
2ss =
57.142=s
s=11.94
4. Calculatethestandarderrorofthemeans.
n
sSE =
8
94.11SE =
SE=4.22
5. Stateyourconclusions
Theobservationsareanaverageof11.94yearsawayfromthemean.Ifweweretotakemanysamplesfromthesamepopulation,theaverageofthesamplemeanswouldbe4.44yearsfromtheactualpopulationmean.
-
CentralLocationandDispersion
BiostatisticsWorkbook 46DRAFT:Aug.28,2007
Practice:MeasuresofDispersionUsethesamedatasettodescribethedispersionoftheobservationsofthevariableVisits.
Individual Age Visits1 45 152 60 83 52 224 46 95 23 26 52 157 37 38 33 13
Minimum,maximum,andrange
Step PracticeSpace1. Identifytheminimum
valueofVisits.
2. IdentifythemaximumvalueofVisits.
3. DeterminetherangeofVisits.
maxmin=range
4. Stateyourconclusions.
-
CentralLocationandDispersion
BiostatisticsWorkbook 47DRAFT:Aug.28,2007
InterquartileRange
Step PracticeSpace1. Arrangeobservations
ofthevariableVisitsinorderofincreasingvalue.
2. Findthepositionofthe1st (Q1)and3rd
(Q3)quartiles.
4)1n(
Q1 +
= 4
)1n(3Q3
+ =
3. Locateeachnumberindicatedinthedataset.
4. FindthedifferencebetweenQ1andQ3todeterminetheinterquartilerange.
Q3Q1=IQR
5. Stateyourconclusions.
-
CentralLocationandDispersion
BiostatisticsWorkbook 48DRAFT:Aug.28,2007
Variance,standarddeviation,andstandarderror
Step PracticeSpace1. Findthemeanofthe
variableVisits.
2. Calculatethevarianceusingtheformulabelow.
2i
n
1i
2 )xx()1n(
1s -
- = S
=
3. Calculatethestandarddeviation.
2ss =
4. Calculatethestandarderrorofthemeans.
n
sSE =
5. Stateyourconclusions.
-
CentralLocationandDispersion
BiostatisticsWorkbook 49DRAFT:Aug.28,2007
EpiInfoExample:MeasuresofDispersionUsethetablebelow(datasetBabyWeighing)tofindmeasuresofdispersionforthevariableAgeinEpiInfo.Firstfindthemaximum,minimum,range,andinterquartilerange.Thencalculatethevariance,thestandarddeviation,andthestandarderror.
Step Example1. READthedatasetin
EpiInfo.OpenEpiInfoandchooseAnalyzeData.
SelectREADandopenthedatabase,Bios_Workbook_Examples.ChoosethedatasetBabyWeighing.
ClickOK.
2. FindtheMEANSofthedataset.
SelectMEANSundertheStatisticsheading.
InthedropdownmenuforMeansOf,chooseAge_in_months.
ClickOK.
Age(months)
Length(cm)
Weight(kg)
21 77 9.834 87 11.523 84 10.830 92 14.027 85 12.024 82 10.831 87 11.626 85 11.822 85 12.432 86 12.0
-
CentralLocationandDispersion
BiostatisticsWorkbook 50DRAFT:Aug.28,2007
Step Example3. Usetheoutputto
determinetherangeandtheinterquartilerange.
Theoutputprovidesyouwiththemaximumandtheminimuminthedata.Findthedifferencetodeterminetherange.
Range=maximumminimumRange=3421=13
Theoutputalsoprovidesthe25thpercentile,equaltoQ1,andthe75thpercentile,equaltoQ3,sothatwecandeterminetheinterquartilerange.
IQR=Q3Q1IQR=3123=8
4. Usetheoutputtoidentifythevarianceandstandarddeviationofthevariable.
Variance=20.67StandardDeviation=4.55
Ifwewanttocalculatethestandarderror,wesimplydividethestandarddeviationbythesquarerootofthenumberofobservations:
44.110
5461.4SE = =
-
CentralLocationandDispersion
BiostatisticsWorkbook 51DRAFT:Aug.28,2007
Step Example5. Describethevariable
intermsofdispersion.TherangeofthevariableAge_in_monthsis13months.Themiddlehalfofthedataspans8months.Theaveragedistanceofeachobservationfromthemeanofthedatais4.55months.Ifweweretotakemanysamplesfromthesamepopulation,wewouldfindthattheaveragesamplemeanis1.44monthsfromtheactualpopulationmean.
EpiInfoPractice:MeasuresofDispersionUsethesamedataset,BabyWeighing,topracticedescribingdataintermsofdispersionwiththehelpofEpiInfo.Determinetherangeandinterquartilerangeandidentifythevariance,standarddeviation,andthestandarderrorofthevariableLength.
FindtheMEANSofthedatasetinEpiInfo.
Usetheoutputtoanswerthefollowingquestions.
Step PracticeSpace1. Determinetherange
andtheinterquartilerange.
Range=
IQR=
2. Identifythevarianceandstandarddeviationofthevariable.
s=______
s2=______
3. Describethevariableintermsofdispersion.
RelatedConcepts
NormalDistribution
-
ProbabilityandtheNormalDistribution
BiostatisticsWorkbook 52DRAFT:Aug.28,2007
Probability andtheNormalDistribution
Uptothispoint,wehavefocusedondescriptivestatistics.Wehavesimplybeenorganizingandsummarizingdatathathasbeencollected.Wealsowanttoexploresomemethodsfordrawingconclusionsaboutpopulationsbasedsolelyondatathatwehaveforasampleofthatpopulation. Becausewecanneverbecertainthatourconclusionsbasedonthissampleaccuratelyrepresentthetargetpopulation,werefertothisasinferentialstatistics.Inferentialstatisticsisbasedonprobabilitytheory,orthescienceofuncertainty.Thefollowingsectionsdescribehowprobabilitytheoryallowsustomakeinferencesaboutapopulationbasedondataobtainedfromasampleofthatpopulation.
-
NormalDistribution
BiostatisticsWorkbook 53DRAFT:Aug.28,2007
ProbabilityDistribution
Probabilityisanindicatorofthelikelihoodthataneventorconditionwilloccur.Somedescribeitasthelongrunrelativefrequencyoftheeventinrepeatedtrialsundersimilarconditions.Itreflectstheproportionofthepopulationwiththeconditionorevent.Forexample,if40%ofworkersinafactoryarefemale,theprobabilitythatarandomlyselectedworkerwillbeafemaleis40%orstatedanotherwayifwerandomlyselectnworkers,theexpectednumberoffemalesinthesampleisnx40%. Alternatively,theexpectednumberofmalesisnx(100%40%),ornx60%.
Probabilitycanalsobeusedtoconsidercontinuousvariables(notjustconditionsoreventsasnotedabove).Itcanindicatethelikelihoodofavalueinaparticularrange.Forexample,if5%ofmenatthefactoryhaveaheightover180cm,theprobabilitythatarandomlyselectedmanwillhaveaheightover180cmis5%.
Probabilitydistributionsrepresenttheprobabilityofthedifferentoutcomes(e.g.male,female)forasampleselection.Therelationshipbetweenthevaluesofavariableandtheprobabilitiesoftheiroccurrencecanbesummarizedinaprobabilitydistribution.
Ifweselectasingleworkerfromthisfactory,theprobabilitydistributionforthepossibleoutcomesforgenderissimple.
Possibleoutcome ProbabilityMale 0.60Female 0.40
Ifweselectthreeworkersthentheprobabilitydistributionbecomesmorecomplicated.
Possibleoutcomes ProbabilityAllmale 0.216=(0.60x0.60x0.60)2male,1female 0.432=(0.60x0.60x0.40)2female,1male 0.288=(0.40x0.40x0.60)
Overview
Aprobabilitydistributionisadistributionofdatabasedonthelikelihoodthataneventorindicatorwilloccurinasampleofthepopulation.
Knowledgeoftheprobabilitydistributionofavariableallowsustodrawconclusionsaboutapopulationbasedondatatakenfromasampleofthatpopulation.
-
NormalDistribution
BiostatisticsWorkbook 54DRAFT:Aug.28,2007
Allfemale 0.064=(0.40x0.40x0.40)
Thereareseveralmodelortheoreticalprobabilitydistributionsthatwillallowustodeterminetheprobabilityofagivenvalueforarandomvariableevenifwedonothave(orknow)thefullprobabilitydistributionforthatvariable.Theseprobabilitydistributionsaregivenorcalculatedbymathematicalformulaecalledprobabilityfunctions. Wecanapplythemodeltocreateaprobabilitydensitycurvewheretheheightofthecurvereflectsthefrequencyoftheindividualvaluesandtheareasinanintervalunderthecurvereflectstheproportionofapopulationinthatinterval.Thisisalsoaprobabilitydistribution.
Examplesofprobabilityandotherdistributionsincludethenormal,binomial,Poisson,Chisquare,F,andtdistributions. Forthesakeofsimplicity,theonlydistributionwewillcoverinthisworkbookisthenormaldistribution.
RelatedConcepts
NormalDistribution
-
NormalDistribution
BiostatisticsWorkbook 55DRAFT:Aug.28,2007
NormalDistribution
Thenormaldistributionisthemostfamousandimportantofthetheoreticalprobabilitydistributionsfortwomainreasons.First,formanyvariablesweencounterinthehealthfield(e.g.height,bloodpressure,hemoglobinlevel,etc.),itisagooddescriptionofthedistributionofthevariable.Secondlyandmoreimportantly,thenormaldistributionhasacentralroleinstatisticalanalysisasitisusedastheprobabilitydistributionofthesamplemeans. Calculationsbasedonthenormaldistributionareusedtoderiveconfidenceintervalsanddeterminepvaluesforquantitativedata,proportions,andrates.
Characteristicsofanormaldistribution:
Itisspecifiedbytwoparameters:thepopulationmeanandthestandarddeviation.
Itissymmetricalaroundthemean,bellshaped,andunimodal.Thisiswhythenormalcurveisfrequentlyreferredtoasthebellcurve.
Themean,median,andmode,areallinthemiddleofthecurve. Thetotalareaunderthecurveabovethexaxisisonesquareunitwith
50%oftheareatotherightofthemeanand50%totheleftofthemean.AccordingtotheEmpiricalRule: Theareaboundedbyonestandarddeviationtotherightandonestandard
deviationtotheleftofthemeanwillrepresentsapproximately68%ofthevalues.
Theareaboundedbytwostandarddeviationstotherightandtwototheleftwillrepresentsapproximately95%ofthevalues.
99.7%ofthevalueswillbewithinthreestandarddeviationsofthemean.Thisisdemonstratedinthegraphonthenextpage:
Overview
Thenormaldistributionisabellshapedcurvewithboththemeanandthemedianatthecenterofthecurve.
Thestandardnormaldistributionisadistributionofdatawithameanofzeroandastandarddeviationofone.Itallowsdifferentpopulationstobecomparedtoeachother.
Formula:Theformulabelowisusedtocalculatethestandardscore,orthezscorewhencomparingnormallydistributedpopulations.
x
=z
-
NormalDistribution
BiostatisticsWorkbook 56DRAFT:Aug.28,2007
Knowingthemeanandstandarddeviationofanormaldistributionallowsonetodeterminethefollowingvalues:
Theproportionofindividualswhofallintoanyrangeofvalues Thepercentileatwhichagivenvaluefalls Thevaluewhichcorrespondstoagivenpercentile
BelowisafrequencydistributionoftheheightofmenintheUSpopulation,characterizedbyanormaldistributionwithameanof171.5cmandastandarddeviationof6.5cm.
=171.5cm
-
NormalDistribution
BiostatisticsWorkbook 57DRAFT:Aug.28,2007
GiventhatthemeanheightofthemenintheUSis171.5cm(=171.5cm)andthestandarddeviationis6.5cm(=6.5cm)andusingourknowledgeofthenormalcurve,weknowthefollowinginformation:
68.3%ofmenarebetween165and178cm ( 1=171.5 6.5) 95.5%ofmenarebetween158.5and184.5cm( 2=171.5 2x6.5)
Whatifwewanttoknowspecificinformationsuchas:
Whatproportionofmenareover180cm? Whatheightvalueisatthe10thpercentile?
Statisticianshavedevisedamethodtotransformallnormaldistributionssothattheyusethesamescale.Thisisknownasthestandardnormaldistribution.Thestandardnormaldistributionisanormaldistributionwithameanof0andastandarddeviationof1. Anormaldistributioncanbecomparedwithothernormaldistributionsbyconvertingittoastandardnormaldistributionusingtheformulashownbelow. Thestandardnormaldistributionspecifieshowfaranindividualvalueisfromthemeaninunitsofthestandarddeviation,whichallowsustocalculateastandardscore.Thestandardscoreisawayofexpressinganindividualvalueintermsofstandarddeviationunits.Thestandardscore,referredtoasthezscore,iscalculatedas (observedvaluemean)dividedbythestandarddeviation.Theformulaisbelow:
x
=z
Thezscorewillalsobereferredtoasateststatistic.Eachdistributionhasacorrespondingteststatistic.Thezscorecorrespondswiththestandardnormaldistribution.
-
NormalDistribution
BiostatisticsWorkbook 58DRAFT:Aug.28,2007
Example:UsingtheStandardNormalDistributionGivenanormaldistributionofmaleheightswith=171.5cmand=6.5cm,whatistheproportionofmentallerthan180cm?
5.65.171180
=x
=z
31.1=5.65.8
=z
Nowthatweknowthezscore,wemustfindtheareaofthestandardnormalcurveabove1.31.
Inordertofindtheareaofthecurvethatisrepresentedbythezscore,1.31,wemustrefertothestandardnormalzdistributionlocatedinAppendix2.
OntheStandardNormalzTable,locatethezscore1.31. Underthecolumnlabeledz,findthevalue,1.3.Therowlabeledzwillprovideyouwiththehundredthsplaceofyourzscore,sofollowitoveruntil0.01.Ifyouplaceonefingeron1.3andononefingeron0.01andfollowthosepathsuntilyourtwofingersmeet,youfindthevalue,0.9049. UsetheexcerptfromtheStandardNormalzTableonthefollowingpagetohelpyoulocatethezscore.
0 1.31
-
NormalDistribution
BiostatisticsWorkbook 59DRAFT:Aug.28,2007
ThistablewillgiveustheareaofthecurvelocatedtotheLEFTofthezscore.Asyoucanseebythediagram,wewanttofindtheareaofthecurvelocatedtotheRIGHTofthezscore. Tofindtheareatotherightofthezscore,wesubtract0.9049from1.
10.9049=0.0951
Therefore,approximately9.5%(0.0951x100%)ofthecurveisabove180cm(orabove1.31SDofthemean).Wecanalsosaythatmenwhoseheightsare180cmandabovearetallerthan90.5%ofAmericanmen. Thus,aheightof180cmrepresentsthe90thpercentile.
Topracticeusingthetableforthestandardnormaldistribution,answerthefollowingquestion.
Whatheightvalueisatthe10thpercentile? Wemanipulatetheformulatosolveforxratherthanz:
x=+(z )where:
xistheobservedvalue isthepopulationmean(given) isthepopulationstandarddeviation(given) zcomesfromthestandardnormaldistribution
-
NormalDistribution
BiostatisticsWorkbook 60DRAFT:Aug.28,2007
Tofindtheanswertothisproblem,firstlookupthezscorefromthetableinAppendix2whichcorrespondstothelowest10%oftheareabeneaththecurve.Thisareawillbeonthelefthandsideofthecurve. Dothisbyreversingthestepswepreviouslyusedtofindthearea.
Locatetheareaclosestto0.10intheztable.Thenfollowtherowandcolumntoidentifythezscorethatitisassociatedwith.Youshouldfindazscoreof1.28.
x=+(z )x=171.5+(1.28x6.5)x=171.58.3525=163.1475
The10thpercentileis163.1cm.Thismeansthat10%ofAmericanmenare163.1cmorshorterand90%ofAmericanmenaretallerthan163.1cm.
Practice:UsingtheStandardNormalDistributionYouhaveattendedanHIV/AIDStrainingwhereapretestandaposttestwasgiveninordertomeasureknowledgegained.Pretestscoresareincludedinthetablebelow.Usethetabletoanswerthefollowingquestions.
PretestScores:HIVKnowledge
Females Males
Mean 60 40
SD 12 10
N 138 97
1. Ifamalegetsascoreof70,whatishiszscore?2. Whatisthezscoreforafemalewithascoreof35?3. Whatscoreforfemalesisequivalenttoamalesscoreof78?
RelatedConcepts
CentralLimitTheorem
-
CentralLimitTheroem
BiostatisticsWorkbook 61DRAFT:Aug.28,2007
CentralLimitTheorem
Notalldataisnormallydistributed.Datathatisnotnormallydistributedrequiresdifferenttestsinordertoproperlyanalyzeandcompareit.Fortunately,ifwehaveanadequatelylargesamplesize,(n>30),thesamplingdistributiontendstoapproachnormalityandweareabletotreatitasnormal.ThisconceptisknownastheCentralLimitTheorem.
Justaswecalculatedthestandarddeviationforadistributionofindividualvaluesaroundamean,wenowcancalculateasimilarmeasureofvariabilityforaseriesofsamplesfromthepopulation.ThisistheStandardErrorofthestatisticandmeasurestheprecisionofthestatistic(meanorproportion)asanestimateofthepopulationmeanorpopulationproportion.Itindicatesthedegreetowhichasamplestatisticreflectsthetruepopulationvalue.
Thestandarderroristhebasisforcalculatingconfidenceintervalsandconductinghypothesistestsformeansandproportions.Thisallowsustomakegeneralizationsaboutalargergroupofindividualsbasedonasubsetorsample.
Asyouknow,mostepidemiologicstudiesarecarriedoutwiththeaimoflearningaboutacharacteristicinatargetpopulation.Itisrarelyfeasibletostudyeveryindividual.Therefore,weusuallycompareexposuresordiseasewithinasampleofthepopulation.Amajorroleofstatisticsistoallowustogeneralizeresultsfromasampletothelargegroupandunderstandhowaccuratelythatgeneralizationreflectstheactualpopulationmean(orproportion).
Overview
Thesamplingdistributionofsamplestatistics(meanorproportion)willlooknormallydistributedforlargesamplesizes.
Simply,ifthesamplesizeislarge(typicallyn>30),thedistributionofsamplemeansorsampleproportionsapproximatesanormaldistribution.
Formula:
n
s=SE
-
CentralLimitTheroem
BiostatisticsWorkbook 62DRAFT:Aug.28,2007
Thus,standarderrorbecomessmallerasngetsbigger,meaningthatthelargerthesamplesize,themoreprobableitisthatthesamplemean, x ,approachesthepopulationmean,.
RelatedConcepts
StatisticalInference
StandardDeviationVs.StandardError
Botharemeasuresofvariationinadataset.
Standarddeviationisameasureofvariation ofindividualobservationsfromthemeaninasetofdata.
Standarderrorofthemeanmeasuresthestandarddeviationofthesamplemeans.
-
StatisticalInference
BiostatisticsWorkbook 63DRAFT:Aug.28,2007
StatisticalInference
Forindividualvaluesweusethezscoretotellushowfaranindividualvalueisfromthemeanofthesample.Anysamplewillhaveanelementofrandomerror,meaningthatbychanceitmaynotlookexactlylikethepopulationfromwhichitwasdrawn.Inferentialstatisticsallowsustoquantifytheamountofrandomerror.
Thestepsforconductinginferentialstatisticaltestsaresimilarforeachtest:
1. Statethenullandalternativehypotheses.2. Determinethedecisionrule.3. Conducttheappropriatetest.4. Interprettheresults.
1. StatethenullandalternativehypothesesHypothesesareformulatedbasedonprovingordisprovingthestatusquo,orwhatwecurrentlyregardtobeastrue.Eachtimewetestanewidea,weareinactualitycomparingittoouroldideaofwhatalreadyisknown.Forexample,ifweknowchloroquinetobeaneffectivemalariadrug,thenwhenwetesttheeffectivenessofanewdrugsuchassulfadoxinepyrimethamine,weusetheolddrug,chloroquine,asthebaseline.Thus,ourexpectationisthatchloroquineworksandtherewillbenodifferencefoundbyusingthenewdrug.Thisbecomesthenullhypothesis,orH0.Thealternatehypothesis(HA),oftenreferredtoastheresearchhypothesis,thenrepresentsthechancethatasignificantdifferenceisfoundbetweenthenewdrugandtheolddrug.Asweknow,adifferencecanbeeitherhigherorlower,betterorworse.Ifwearetestingforanydifference,wewilluseatwotailedtest.Ifwearetestingtoseeinwhichdirectionthedifferencelies,weuseaonetailedtest.Usingthesamelevelofsignificance(alphavalue),atwotailedtestismorestringentthanaonetailedtest.
2. DeterminethedecisionruleAnalphavalue()determinesthelevelofsignificanceatwhichyouwillconductyourtest.Thisvalueischosenbytheresearcher.Themostcommonalphavalueseenandonewhichisconsideredanacceptablelevelofsignificancebyresearchersworldwideis0.05,or5percent.Youwillalsoseeanalphavalueof0.10,butanythingbelowthatisgenerallyconsideredtobetoolenienttoaccountfordifferencesbeyondthosewhicharerandomorcoincidentaloccurrences.
Wecangenerallydeterminetheresultsofhypothesistestinginthreeways:1)bycomparingacalculatedvalue(tcalc)toacriticalvalue(tcrit)2)bycomparingthealphavaluetoapvalue,and3)bydeterminingifthevaluespecifiedinthenullhypothesisiscontainedwithinthelimitsofaconfidenceinterval. Thecalculatedvalueisalsoreferredtoastheteststatisticandiscalculatedthroughtheuseofdescriptivestatisticsforthesample.Acriticalvalueisidentifiedbyusingthecorrecttable.Analphavalue,aspreviouslydiscussed,isspecifiedbythe
-
StatisticalInference
BiostatisticsWorkbook 64DRAFT:Aug.28,2007
researcherandwillbegiven.Thepvaluecorrespondstothevalueofthecomputedteststatisticandcanbefoundinsometables,ordeterminedusingastatisticalsoftwarepackage.
Whenthevalueofthecomputedteststatisticexceedsthecriticalvalue,(i.e.tcalc>tcrit)wecanrejectthenullhypothesis.When>p,wecanalsorejectthenullhypothesis. Lastly,ifthevaluespecifiedinthenullhypothesisisnotcontainedwithinthelimitsofourconfidenceinterval,wecanonceagainrejectthenullhypothesis. Notethatwhenwearenotabletorejectthenull,weusethephrasefailtorejectthenull.Weneveracceptthenull.Weonlyrejectitorfailtorejectit.Byrejectingthenull,wehaveprovenouralternativehypothesistobetrue.
3. ConducttheappropriatetestThereareseveraldifferentteststatisticsthatyoumustchoosefromwhentestingforstatisticalsignificance.Theteststatisticyouwillusedependsontheknownparametersofthevariable.Ifapopulationstandarddeviation()isknown,thenweusetheztest.Withtheexceptionoftestsofproportionorverysmallpopulations,wewillgenerallyknowonlythestandarddeviationofasample(s),inwhichcaseweusethettest.Therefore,whentalkingaboutstatisticaltestsingeneral,wearereferringtothetdistribution.Thetdistributionlooksverysimilartothenormalzdistribution,butthetailsoneithersideofthecurvearelonger.
Letusnowrevisitthegeneralformulafortheconstructionofateststatistic:
teststatistic=samplestatistichypothesizedpopulationparameterstandarderroroftherelevantsamplestatistic
Forcontinuousdataanalyzedusingthetwosamplettest,thenumeratorcomparesthedifferencebetweenthetwosamplemeans ( ) 21 xx referredtoasthesamplestatisticorpointestimatehere,withthedifferencethatwouldbeexpectedunderatruenullhypothesis(i.e., 0=:H 210 ) referredtoasthehypothesizedpopulationparameter,whichoftenequalszero.Thedenominatorismadeupbythestandarderror,whichservesasourmeasureofvariability.
4. InterprettheresultsThedistributiontablesthatyouwillneedinordertointerpretresultswhenconductingtestsbyhandareincludedattheendofthisworkbook.TheyincludetheStudentsttable,thenormalstandardzdistribution,andthechisquaredistributiontables. TablesneededtocompletetheexercisespresentedinthisworkbookareincludedinAppendix2.
-
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 65DRAFT:Aug.28,2007
ConfidenceIntervalAroundaMean
Thesamplemean( x )estimatesthepopulationmean()butsuppliesnoinformationonthevariabilityorourconfidenceintheestimate. Forthisreason,weuseconfidenceintervals.
TheintervalestimatemakesuseoftheCentralLimitTheoremandthezscore.Wefirstdeterminehowconfidentwewanttobeinourestimate.Themostcommonlevelofconfidenceis95%.AswelearnedwiththeEmpiricalRule,afeatureofthenormalcurveisthat95%ofthevalueswillbewithintwostandarddeviationsofthemean. Thisvalueof2isroundedupfromtheexactvalueof1.96. Thustheprobability(P)thatzfallsbetween1.96and+1.96is0.95,or95%.
Ifwesubstituteourformula,n/)x( ,forz,weget
Aftersomealgebra,weendupwiththeformulaforthe95%confidenceintervalaroundthemeanas:
Theprobabilitythatthepopulationmeanliesbetweenoursamplemeanisplusorminus1.96timesthestandarderror,whichisequalto95%. Themultiplier1.96waschosenfromthestandardztablewithanalpha0.05.If,forexample,wewantedtocalculatea99%confidenceinterval,wewouldusethezscorethatcorrespondswithanalphaof0.01. (Notethatitisthestandarderrorofthemeanthatwearemultiplyingbythezscore.)
Overview
Theconfidenceintervalofthemeangivestherangeofplausiblevaluesforthetruepopulationmean.
95%ofthetime,thepopulationmeanwillbewithinapproximatelytwostandarderrorsofthesamplemean.
Formula:
95%CI= )n
96.1+x,
n
96.1x(
95.0)96.196.1( = + - zP
95.0)96.1(
96.1( = + /
) -
n
xP
s m
95.0=)n
96.1+x
n
96.1x(P
)n
96.1+x,
n
96.1x(
-
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 66DRAFT:Aug.28,2007
Thus,the95%confidenceintervalis:
StepbyStepExample:ConfidenceIntervalAroundaMeanYouwanttodeterminethemeanbloodpressureamonggovernmentemployees.Inordertodothis,youmeasurethebloodpressureof200employees. Usethedescriptivestatisticsbelowtodeterminea95%confidenceintervalaroundthemean.
n=200x =127mmHgs=13
Step Example1. Calculatethestandard
errorofthemean.
n
s=SE
SE=200
13=0.92
2. Findthelowerlimitofthe95%confidenceinterval.
95%LL= )SE(96.1x
95%LL= )92.0(96.1127=1271.80=125.2
3. Findtheupperlimitofthe95%confidenceinterval.
95%UL= )SE(96.1+x
95%UL=1271.96(0.92)=1271.80=128.8
4. Interpretthe95%confidenceinterval.
The95%confidenceintervalis(125.2,128.8).Thismeansthatwithrepeatedrandomsampling,95%ofthemeanswillfallbetween125.2and128.8.Weare,therefore,95%confidentthatthisisoneofthoseintervalsandthetruemeanofthepopulation()isbetween125.2and128.8.
-
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 67DRAFT:Aug.28,2007
Practice:ConfidenceIntervalAroundaMeanYourecordgestationalageatbirthforlivebirthsinthepastmonthatthreeprimaryhealthfacilitiesintheregion. Calculatea95%confidenceintervalaroundthemean.
n=350x =37.5weekss=12.2
Step PracticeSpace1. Calculatethestandard
errorofthemean.
n
s=SE
2. Findthelowerlimitofthe95%confidenceinterval.
95%LL= )SE(96.1x
3. Findtheupperlimitofthe95%confidenceinterval.
95%UL= )SE(96.1+x
4. Interpretthe95%confidenceinterval.
-
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 68DRAFT:Aug.28,2007
OpenEpiExample:ConfidenceIntervalAroundaMeanUsingthesamebloodpressuredataasbefore,useOpenEpitocalculatea95%confidenceintervalaroundthemean.
n=200x =127mmHgs=13
Step Example1. OpentheOpenEpi
application.FromtheOpenEpimenuchooseMeanCIundertheheading,ContinuousVariables.
2. Enterthedescriptivestatisticsasprompted.
ClickonEnterNewData.
Thescreenshownabovewillopenup.
Usethegiveninformationtofillintheboxes.
Noticethatyouonlyneedtoprovideeitherthestandarddeviation,thestandarderror,orthevariance.Youdonotneedtoprovideallthree.Sincethestandarddeviationisgiven,thisisthestatisticthatwewilluse.
Becauseourpopulationislargeandunknown,wecanusethedefaultnumber,999999999,torepresentthepopulationsize. Ifyouhaveaknownpopulation,specifythatnumberhere.
-
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 69DRAFT:Aug.28,2007
Step Example3. Calculatethe95%
confidenceinterval.ClickonthebuttonlabeledCalculate.
Apopupwillopendisplayingtheresultsofthecalculation.Notethatyoumustsetyourbrowsertoallowpopupsinordertoviewtheresults.
4. Interprettheresults.
Choosethe95%confidenceintervalcorrespondingwiththettest,sincewedonotknowthevarianceofthepopulation,onlythestandarddeviationofthesample.
The95%confidenceintervalis(125.2,128.8).
Withrepeatedrandomsampling,95%ofthemeanswillfallbetween125.2and128.8.Weare,therefore,95%confidentthatthisisoneofthoseintervalsandthetruemeanofthepopulation()isbetween125.2and128.8.
-
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 70DRAFT:Aug.28,2007
ExcelExample:ConfidenceIntervalAroundaMeanWecanfindaconfidenceintervalaroundameanusingdescriptivestatisticsinExcelaswell. Usethesamebloodpressuredatathatweusedinthepreviousexample.
Step Example1. Selecttheconfidence
intervalfunctioninExcel.
Inablankworksheet,chooseInsertfromthetoolbar.Fromthedropdownmenu,selectFunction.
TypeconfidenceintervalintheboxlabeledSearchforafunction.Thefunctionforconfidenceintervals,CONFIDENCEwillappearasyouronlyoption.Alternatively,youcanscrolldownthelistoffunctionsuntilyoufindtheonelabeledCONFIDENCE.
ClickonOK.
-
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 71DRAFT:Aug.28,2007
Step Example2. Enterthedescriptive
statistics.
Youwillbepromptedtoenterthealpha,standarddeviation,andsamplesize.Sincewearecalculatinga95%confidenceinterval,=1.000.95andistherefore,0.05.
ClickonOK.
Theresultwillthenbedisplayedontheworksheetinthecellmarkedbyyourcursor.
Theresultistheequivalentofz(SE).
-
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 72DRAFT:Aug.28,2007
Step Example3. Calculatethe95%
confidenceinterval.Therefore,wecancalculatethe95%confidenceintervalbysubtractingandadding1.80tooursamplemeanof127.
95%LL=1271.80=125.2
95%UL=127+1.80=128.8
4. Interpretyourresults. The95%confidenceintervalis(125.2,128.8).Thismeansthatwithrepeatedrandomsampling,95%ofthemeanswillfallbetween125.2and128.8.Weare,therefore,95%confidentthatthisisoneofthoseintervalsandthetruemeanofthepopulation()isbetween125.2and128.8.
YoucanalsouseExceltofindtheconfidenceintervalaroundthemeanifyouaregivenadatasetinsteadofdescriptivestatistics.
ExcelExample:ConfidenceIntervalAroundaMeanForthisexample,wewillusethedatasetSit/Lie.Calculatea95%confidenceintervalaroundthemeanforthevariableSitting.
Step Example1. Importthe
datasetintoExcel.
Importthedataset,twosamplet,byusingthedirectionsintheboxbelow.
ToopenadatasetinExcel:
ChoosetheheadingDatafromthetoolbar.ClickonImportExternalData.ClickonImportData.Openthefolderwhereyouhavestoredthedatabase.Choosethetablethatyouwillbeworkingfrom.ClickOK.Choosewhereyouwouldliketoputthedatabyselectingacellofthecurrentworksheetorseclectinganewworksheet.ClickOK.
-
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 73DRAFT:Aug.28,2007
Step Example2. Calculatethe
95%confidenceintervalusingExcel.
ChooseToolsfromthetoolbar.SelectDataAnalysisfromthedropdownbox.HighlightDescriptiveStatisticsandclickOK.Youwillseeaboxliketheonebelow:
ClickonthecharticonnexttothetextboxmarkedInputRange.
HighlightthecolumnforthevariableSittingbyclickingontheletterwhichcorrespondswiththecolumn.
ClickonthecharticonintheboxlabeledDescriptiveStatisticstoreturntothedialoguebox.
ChecktheboxnexttoLabelsinFirstRow.
Next,chooseyouroutputoptions. Anewworksheetischosenasthedefault,butifyouwouldlikeyouroutputtoappearonthesameworksheetasyourdataset,selectthefirstoptionunderOutputoptions,OutputRange. Clickontheiconnexttothetextbox. Choosetheareawhereyouwouldlikeyouroutputtoappearbyclickingonacell.Clickontheiconagaintoreturntothedialoguebox.
ChecktheboxesnexttoSummarystatisticsandConfidencelevelforMean.
ClickOK.
-
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 74DRAFT:Aug.28,2007
Step Example3. Usetheoutput
tocalculatetheconfidenceinterval.
Youroutputwilllooklikethis:
Noticethattheoutputdoesnotactuallyprovideyouwithaconfidenceinterval.Instead,youaregivenanumberwhichrepresentsthedifferencefromthemean.Tofindtheconfidenceintervalaroundthemean,subtractthisnumberfromandaddthisnumbertothemean.
95%CI= x confidencelevel=80.9514.13,80.95+14.13=66.82,95.08
4. Interprettheresults.
The95%confidenceintervalaroundthemeanis(66.82,95.08).Withrepeatedrandomsampling,95%ofthemeanswillfallbetween66.82and95.08.Weare,therefore,95%confidentthatthisisoneofthoseintervalsandthetruemeanofthepopulation()isbetween66.82and95.08.
ExcelorOpenEpiPractice:ConfidenceIntervalAroundaMeanUsingthedatafromtheHIVKnowledgepretest,calculatethe95%confidenceintervalaroundthemeanscoreforfemalesineitherExcelorOpenEpi.
PretestScores:HIVKnowledge
Females Males
Mean 60 40
SD 12 10
N 138 97
Foradditionalpractice,calculatethe95%confidenceintervalaroundthemeanscoreformalesbyusingthecomputerapplicationthatyoudidnotpreviouslyuse.
-
ConfidenceIntervalAroundaMean
BiostatisticsWorkbook 75DRAFT:Aug.28,2007
1. Opentheappropriateapplication.
2. Enterthedescriptivestatistics.
Step PracticeSpace3. Calculatethe95%
confidenceinterval.
4. Interpretyourresults.
RelatedConcepts
ConfidenceIntervalAroundaProportionConfidenceInterval:TwoSampletTest
-
ConfidenceIntervalAroundaProportion
BiostatisticsWorkbook 77DRAFT:Aug.28,2007
ConfidenceIntervalAroundaProportion
TheCentralLimitTheoremalsoapplieswhenconsideringadistributionofsampleproportions,whenthesamplesizeislargeenough.Thesamplingdistributionwouldbeconstructedsimilarlyasforthemean.Howeverthecharacteristicsofthesamplingdistributionwillbedifferentasthisisabinomialdistribution.Wewillbeestimatingthepopulationproportionratherthanthepopulationmean.Sincethebinomialdistributionisasamplingdistributionforp,itsmeanequalsthepopulationmeananditsstandarddeviationrepresentsthestandarderror(SE).
n=samplesizeornumberoftrials p=probabilityofsuccess 1p=probabilityoffailure
SEoftheproportion=n
)p1(p
Asthesamplesize,n,increases,thebinomialdistributionbecomesveryclosetoanormaldistributionduetothecentrallimittheorem
Therefore,thenormaldistributioncanbeusedtocalculateconfidenceintervalsanddohypothesistests
Ifnpandn(1p)areequalto10ormore,thenthenormalapproximationmaybeused
Similartothemethodusedtocalculateaconfidenceintervalaroundamean,tocalculatethe95%confidenceintervalaroundaproportion,wefirstcalculatethestandarderroroftheproportionandthenusethesameformula:
95%CIn
)p1(p96.1p=
Overview
Theconfidenceintervalaroundaproportiongivestherangeofplausiblevaluesforthetruepopulationproportion.
95%ofthetime,thepopulationproportionwillbewithinapproximatelytwostandarderrorsofthesampleproportion.
Formula:
95%CIn
)p1(p96.1p=
,
n)p1(p
96.1+p
-
ConfidenceIntervalAroundaProportion
BiostatisticsWorkbook 78DRAFT:Aug.28,2007
StepbyStepExample:ConfidenceIntervalAroundaProportionOutof212pregnantwomentestedforHIV,53hadpositiveresults.Usethisinformationtofinda95%confidenceintervalforthepopulation.
Step Example1. Identifypand1p.
p,theproportionofsuccess= 25.0=21253
1p,theproportionoffailures=10.25=0.75
2. Calculatethe95%lowerlimit.
95%LLn
)p1(p1.96p=
95%LL212
)75.0(25.096.125.0=
=0.25 96.12121875.0
=0.251.96 00088.0=0.25(1.96x0.0297)=0.250.0583=0.1918
3. Calculatethe95%upperlimit.
95%ULn
)p1(p1.96+p=
95%UL212
)75.0(25.096.1+25.0=
=0.25+0.0583=0.3083
4. Interprettheinterval. The95%confidenceintervalis(0.19,0.31).Withrepeatedrandomsampling,95%ofintervalscalculatedwillcontainthetrueproportionofthepopulation.Weare95%confidentthatthisisoneofthoseintervalsandtheprevalenceofHIVinthepopulationisbetween19%and31%.
Note:Yousee(1p)referredtoasqlaterinthisworkbook,aswellasinmanybiostatisticstexts.
-
ConfidenceIntervalAroundaProportion
BiostatisticsWorkbook 79DRAFT:Aug.28,2007
Practice:ConfidenceIntervalAroundaProportionUpontesting250confirmedAIDScases,youfindthat116arepositivefortuberculosis.Findthe95%confidenceintervalaroundtheproportionofAIDSpatientsinfectedwithTB.
Step PracticeSpace4. Identifypand1p.
4. Calculatethe95%lowerlimit.
95%LLn
)p1(p1.96p=
4. Calculatethe95%upperlimit.
95%ULn
)p1(p1.96+p=
4. Interprettheinterval.
-
ConfidenceIntervalAroundaProportion
BiostatisticsWorkbook 81DRAFT:Aug.28,2007
OpenEpiExample:ConfidenceIntervalAroundaProportionUsingthepreviousexample,wewilldemonstratehowtocalculatea95%confidenceintervalaroundaproportion.Outof212pregnantwomentestedforHIV,53hadpositiveresults.Usethisinformationtofinda95%confidenceintervalforthepopulationinOpenEpi.
Step Example1. OpentheOpenEpi
application.FromtheOpenEpimenuchooseProportionundertheheading,Counts
2. Entertheproportiondataasprompted.
ClickonEnterNewData.
Ascreenliketheoneabovewillopen.
Usethegiveninformationtofillintheboxes.Thenumeratorwillalwaysconsistofthenumberofsuccesses,orp.Thedenominatoristhesizeofthepopulationorsample.
3. Calculatethe95%confidenceinterval.
ClickonthebuttonlabeledCalculate.
Apopupwillopendisplayingtheresultsofthecalculation.Notethatyoumustsetyourbrowsertoallowpopupsinordertoviewtheresults.
-
ConfidenceIntervalAroundaProportion
BiostatisticsWorkbook 82DRAFT:Aug.28,2007
Step Example4. Interprettheresults.
OpenEpicalculatesthe95%confidenceintervalbyusingseveraldifferentmethods.ThoughtheeditorsrecommendtheMidPExacttolookatfirst,itistheWald(NormalApproximation)thatcorrespondsmostcloselywithourhandcalculations.
The95%confidenceintervalis(0.19,0.31).Withrepeatedrandomsampling,95%ofintervalscalculatedwillcontainthetrueproportionofthepopulation.Weare95%confidentthatthisisoneofthoseintervalsandtheprevalenceofHIVinthepopulationisbetween19%and31%.
-
ConfidenceIntervalAroundaProportion
BiostatisticsWorkbook 83DRAFT:Aug.28,2007
OpenEpiPractice:ConfidenceIntervalAroundaProportionTherehasbeenameningitisoutbreak.Youfindthatinoneschool,threestudentsoutofanenrolled400havebeeninfectedwithmeningitis.UseOpenEpitocalculatea95%confidenceinterval.
1. OpentheOpenEpiapplication.
2. Entertheproportiondataasprompted.
3. Calculatethe95%confidenceinterval.
Step PracticeSpace4. Interprettheresults.
RelatedConcepts
ConfidenceInterval:ztestofProportions
-
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 85DRAFT:Aug.28,2007
HypothesisTesting:TwoSamplettest
Usedforcontinuousdata,thettestisoneofthemostcommonlyusedstatisticaltestsperformedinthepublichealthandclinicalliterature.Hypothesistesting
Overview
Testemployedtoevaluatethenullhypothesis ( ) 0H thatthepopulationmeansareequalversusthealternativehypothesis ( ) aHthatthepopulationmeansaredifferent.Thistestisusedtocomparethemeansoftwoindependentsamples.
Example:Comparingthedifferenceinmeanbloodpressureforasampleofrefugeestothatofasampleofhostcountryresidents.
Formula: ( ) ( )
2
2p
1
2p
2121
n
s
n
s
xxt
+
- - - =
Assumptions:o Twoindependentrandomsampleso Normallydistributedpopulationo Equal,butunknownvariancesinthetwosamples(Note:ThereisamethodtocomparetwosampleswithunequalvariancescalledSatterwaitesmethod.Pleaserefertoabiostatisticstextforfurtherexplanation.)
Typeofvariables:Continuous Decisionrule:Ifthecalculatedvalueoft( calct )isgreaterthanthe
criticalvalueoft( critt ),thenwecanrejectthenullhypothesis. Tableused:Studentsttable
Where:
( ) ( ) 2nn
s1ns1ns
21
222
2112
p - + - + -
=
andisreferredtoasthepooledvariance.
-
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 86DRAFT:Aug.28,2007
usingthettestallowsustodeterminewhethertheobserveddifferencebetweenthemeanvaluesoftwogroupsisstatisticallysignificant.
Avitalcomponentusedinthecalculationofthestandarderrorforthetwosamplettestisthepooledvariance,denoted 2ps .Asindicatedabove,amajorassumptionnecessaryforthevalidityofthetwosamplettestisthatthevariancesareunknown,butassumedtobeequal. Wecanjustifythisassumptionbydividingthevarianceofonesamplebythevarianceofthesecondsample
(22
21
ss
). If22
21
ss
equalsavalueoflessthanthree,assumethatthevariancesare
approximatelyequal.Thecloserthatthisvalueistoone,themoreequalthevariancesare. Whenthisassumptionisjustified,apooledestimateofthecommonvariancecanbecalculated ( ) 2ps ,whichestimatestheoverallvarianceoftheentirestudypopulation.
Thepooledestimateisobtainedbycomputingtheweightedaverageofthetwosamplevariances.Thesamplevariances ( ) 2221 sands areweightedaccordingtothenumberofobservationsineach.Ifthesamplesizesareequal( 21 nn = ),thisweightedaverageisthemeanofthetwosamplevariances.Ifthetwogroupsareofunequalsize( 21 nn ),thepooledvarianceiscalculatedasfollows:
( ) ( ) 2nn
s1ns1ns
21
222
2112
p - + - + -
=
OurteststatisticisdistributedintheStudentsttablewith 2nn 21 - + degreesoffreedom.
StepbyStepExample:HypothesisTestingTwoSamplettestCanweconcludethatinfantsbornatalowincomeareaclinic,ontheaverage,tendtobelighterthanthosebornataclinicservingahighincomepopulationarea?Withinthepastmonth,astudenthascollecteddataonbirthweights(grams)from arandomsampleof80deliveriesatahighincomepopulationservingclinic(High)and100deliveriesatalowincomepopulationservingclinic(Low).Therelevantinformationissummarizedbelowinthetable. Letalphaequal0.05.
Clinic n x sHighClinic(1) 80 2800 100LowClinic(2) 100 2650 82
-
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 87DRAFT:Aug.28,2007
Step Example1. Statethenulland
alternativehypotheses.
Theresearcherwilldetermineifthemeanvalueforonegroupislowerthanthatoftheother,soaonesidedtestofourhypothesesisindicated.
Ournullhypothesisstatesthatthemeanbirthweightofbabiesbornatthehighincomeclinic(1)shouldbelessthanorequaltothatofbabieswhoarebornatthelowincomeclinic(2).Thenullhypothesisiswrittenas:
210 :H m m
Thealternativehypothesisstatesthatthemeanbirthweightofbabiesbornatthehighincomeclinic(1)isgreaterthanthatofthosebornatthelowincomeclinic(2),andiswrittenas:
21a :H m m >
Anotherwayofstatingthehypothesesisbelow.Hereyouarestatingthatthedifferencebetweenthetwopopulationmeans(D)islessthanorequaltozero(null)orthedifferenceisgreaterthanzero(alternative).
0:H 210 - m m 0:H 21a > - m m
2. Statethedecisionrule.
Usingaonesidedtestwithanalphavalueof0.05and 2nn 21 - + =178df,thecriticalvalueoftheteststatisticis1.645. WeobtainthisvaluefromtheStudentsttable.Notethat178degreesoffreedomisnotonthetable,soweapproximateitbyusinginfinity().
Thus,weshouldreject 0H if 1.645tcalc >
-
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 88DRAFT:Aug.28,2007
Step Example3. Calculatethevalueof
theteststatistic.Computingthevalueoftheteststatisticinvolvesseveralsteps. Theformulawewillfollowis
( ) ( )
2
2p
1
2p
2121
n
s
n
s
xxt
+
- - - =
a. Calculatethedifferenceinsamplemeans.
( ) 21d xxx =
Beginbycomputingthedifferenceinsamplemeans:
( ) 21 - isassumedtobe0becauseournullhypothesisstatesthatthereisnodifferencebetweenthetwopopulations.
( ) 21 xx - iscomputedas: 15026502800 = -
b. Computethevalueofthepooledvariance.
( ) ( ) 2nn
s1ns1ns
21
222
2112
p - + - + -
=
Thepooledvarianceiscalculatedas:
( ) ( ) 8177.955
178829910079
s22
2p =
+ =
c. Findthevalueforthestandarderror.
2
2p
1
2p
n
s+
n
s=SE
Thiswillbethedenominatorofthetcalcequation.Usingthepooledvariancecalculatedabove,thestandarderroriscomputedas:
13.56100
8177.95580
8177.955 = +
d. Determinethevalueof calct .
( ) ( )
2
2p
1
2p
2121
n
s
n
s
xxt
+
- - - =
Specifically,wearetakingourcalculationsfrompartsaandcandsubstitutingthoseintoourformula.
11.0613.56
0150tcalc = =
-
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 89DRAFT:Aug.28,2007
Step Example4. Statethestatistical
decision.Wereject 0H sincethevalueofourteststatistic calct=11.06exceedsthetcriticalvalueof1.645.Wethereforehaveevidencethatourteststatisticfallsintherejectionregion.
5. Reportthepvalue. Forthistest,apvalue
-
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 90DRAFT:Aug.28,2007
Step PracticeSpace3. Calculatethevalueof
theteststatistic.
a. Calculatethedifferenceinsamplemeans.
( ) 21d xxx =
b. Computethevalueofthepooledvariance.
( ) ( ) 2nn
s1ns1ns
21
222
2112
p - + - + -
=
c. Findthevalueforthestandarderror.
2
2p
1
2p
n
s+
n
s=SE
d. Determinethevalueof calct .
( ) ( )
2
2p
1
2p
2121
n
s
n
s
xxt
+
- - - =
4. Statethestatisticaldecision.
-
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 91DRAFT:Aug.28,2007
Step PracticeSpace5. Reportthepvalue.
6. Statethepracticalconclusion.
EpiInfoExample:HypothesisTestingTwoSamplettestWewillusetheexampleonpage86toconductatwosamplettestinExcel. Wearedeterminingwhetherinfantsbornatalowincomeareaclinictendtohavealowerbirthweightthanthosebornataclinicservinganareawithahighincomepopulation.Forthisstatisticaltest,wewilluseaonetailedanalysissincewewanttoknowspecificallywhetherbabiesbornattheclinicservingalowincomepopulationarea,ontheaverage,tendtobelighterthanthosebornattheclinicservingahighincomepopulationarea,andnotonlyifthebirthweightsdiffer.Assumeanof0.05.
Step Example1. Statethenulland
alternativehypotheses.
H0:12or120(Babiesborninthehighincomeareaclinicweighlessthanorequaltothoseborninaclinicservingalowincomearea.)
Ha:1>2or12>0(Babiesborninthehighincomeareaclinicweighmorethanthosebabiesborninaclinicservingalowincomearea.)
2. Statethedecisionrule.
Wewillchooseanalphavalueof0.05inordertocompareourresultswiththecomputerprogramtothosewhichwepreviouslycalculatedbyhand.
If>p,wecanrejectthenullhypothesis.
Inaddition,ifweknowthecriticaltvalue,theniftcalc>tcrit,wecanrejectthenullhypothesis.
-
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 92DRAFT:Aug.28,2007
Step Example3. Executethetwo
samplettest.
a. READthedatabasefile.
OpenEpiInfoandchooseAnalyzeData.
Choosethetabletwo_sample_tfromthedatasetBios_Workbook_Examples.
b. SelecttheMEANScommand.
UsethearrowunderMeansoftoscrollthroughthevariables.ChooseBirthweight.
ScrolldownunderCrosstabulatebyValueofandchooseClinic.
ClickonOK.
Scrolldowntofindthedescriptivestatistics.Theyshouldlooklikethis:
4. Reportthepvalueand/orthecalculatedtvalue.
Ourpvaluegivenintheoutputis0.00.
Wehavefoundatstatisticof11.05,whichdiffersonlyslightlyfromthetstatisticcalculated(11.06)onpage88.Thiscouldbeduetoroundingerrorsthatwemadeinourcalculations.
NotethatEpiInfousesanalphavalueof0.05andatwotailedtestasdefaults.
-
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 93DRAFT:Aug.28,2007
Step Example5. Statethestatistical
decision.Sinceourpvalueof0.00*islessthanthealphaof0.05,wehavesufficientevidencetoconcludethatthereisasignificantdifferencebetweenbirthweightsinthetwoclinics.
RememberthatwecanfindourcriticaltvaluebyusingtheStudentsttable.Inthiscaseitis1.645(usethetotalobservationstofindNandthetotaldegreesoffreedom).Sinceourcalculatedtis11.0545andisgreaterthan1.645,wecanconfirmtheabilitytorejectthenullhypothesis.
6. Statethepracticalconclusion.
Becausep
-
HypothesisTesting:TwoSamplettest
BiostatisticsWorkbook 94DRAFT:Aug.28,2007
EpiInfoPractice:HypothesisTestingTwoSamplettestTherewasanoutbreakofcholeraamongstudentsinavillageschool. Youweregivenarecordofthoseinfectedbytheschooldirector. Ofthestudentsinfectedwithcholera,youwanttodetermineifthereisasignificantdifferenceintheageoftheinfectedbygender.UsethettestinEpiInfotodetermineifthereisasignificantdifference(alpha=0.05)betweenthemeanagesofmalesandfemalesinfectedwithcholera.UsethetableAgeInSchoolfromthedataset,Bios_Workbook_Examples.
Step PracticeSpace1. Statethenulland
alternativehypotheses.
2. Statethedecisionrule.
3. Performatwosamp