biostatistics workbook aug07-1

BiostatisticsWorkbookFieldEpidemiologyandLabTrainingPrograms(FELTP)

DRAFT

DepartmentofHealthandHumanServicesCentersforDiseaseControlandPrevention

CoordinatingOfficeforGlobalHealthOfficeofCapacityDevelopmentandProgramCoordination

DivisionofEpidemiologyandSurveillanceCapacityDevelopment

Acknowledgements:

Wethankthefollowingfortheirtimeandeffortsindevelopingthecontentofthisworkbook:

DonnaJonesMichaelA.JosephJenniferScharff

NadineSunderland

ContentReview:

EdmondMaesPeterNsubuga

BiostatisticsWorkbook 5DRAFT:Aug.28,2007

TableofContents

HowtoUsethisWorkbook ...........................................................................................6IntroductiontoBiostatistics ..........................................................................................7

ScalesofMeasurement ................................................................................................9FrequencyDistributions ............................................................................................11

CentralLocationandDispersion ................................................................................33MeasuresofCentralTendency...................................................................................34MeasuresofDispersion .............................................................................................41

ProbabilityandtheNormalDistribution ...................................................................52ProbabilityDistribution.............................................................................................53NormalDistribution ..................................................................................................55CentralLimitTheorem ..............................................................................................61

StatisticalInference .....................................................................................................63ConfidenceIntervalAroundaMean ..........................................................................65ConfidenceIntervalAroundaProportion..................................................................77HypothesisTesting:TwoSamplettest ......................................................................85ConfidenceIntervalEstimation:TwoSamplettest ...................................................95HypothesisTesting:ztestforDifferenceinProportions ..........................................106ConfidenceIntervalEstimation:ztestforDifferenceinProportions .......................115HypothesisTesting:Pairedttest .............................................................................125ConfidenceIntervalEstimation:Pairedttest ..........................................................136FishersExactTest ..................................................................................................145ChiSquareTestforIndependence ...........................................................................155

ConfidenceIntervalsforCaseControlandCohortStudies....................................163ConfidenceIntervals:OddsRatiosandRelativeRisks .............................................164

SampleSize ................................................................................................................181SampleSizeforDescriptiveStudies .........................................................................182SampleSizeforAnalyticStudies ..............................................................................191

CorrelationandRegressionAnalysis........................................................................205PearsonProductMomentCorrelationCoefficient...................................................206SimpleLinearRegression ........................................................................................217OneWayAnalysisofVariance(ANOVA) ................................................................223

References..................................................................................................................231Appendix1:AnswerKey ..........................................................................................234Appendix2:DistributionTables...............................................................................243

StudentstTable......................................................................................................244StandardNormalz...................................................................................................245ChiSquareDistribution ...........................................................................................246FDistribution ..........................................................................................................247


HowtoUsethisWorkbook

Thisworkbookisintendedasaresourceforstudentsinintroductorybiostatisticscourses. Itprovidesstudentswithstepbystepguidancethroughexampleproblemscalculatedbyhandandwithreadilyavailablestatisticalsoftwareprograms.Practiceproblemsaregiven,alongwithananswerkey,sothatstudentsareabletosolidifywhattheyhavelearnedintheirbiostatisticscourses.

Theworkbookmayalsobeusedasareferenceonceastudenthascompletedabiostatisticscourse. Thoughitdoesnotprovidedetailedinformationonthetheoryofbiostatisticalconcepts,itwillserveasarefresherastowhatstatisticaltestshouldbeusedinagivensituationandhowtodothecalculationsthataccompanythattest.

IntroductiontoBiostatistics


IntroductiontoBiostatisticsThisworkbookprovidesanoverviewofbasicbiostatisticstopicsincludingscalesofmeasurement,centrallocationanddispersion,normaldistribution,testsofstatisticalinference,samplesize,andcorrelationandregressionanalysis.Followingthedescriptionareexamplesandpracticeproblemstobecompletedbothbyhandandwiththeaidofastatisticalcomputerprogram.Theseexamplesandpracticeproblemswillgiveyouanopportunitytoapplytheconceptstosituationsthatyoumayfindinthefield. DatasetsforthepracticeproblemsareeitherincludedintheworkbookorontheaccompanyingCD. Asyoucompletethepracticeproblems,youmaycheckyourworkbyreferringtotheanswerkeylocatedinAppendix1.

Thisworkbookismeantasasupplementaltextandisnotintendedtoreplaceyourregularbiostatisticscourse.However,weallneedafriendlyreminderfromtimetotime.Forthisreason,wehaveincludeddefinitionsofcommonlyusedtermsinbiostatisticsforyourreference.

Data: Therawmaterialofstatistics,datagenerallyconsistsofnumbersofmeasurementorcountsofapopulationsample.Forexample,anursemayrecordthetemperatureofpatients(ameasurement)orcountthenumberofpatientswithatemperatureabovenormal.

Variable: Thetermforacharacteristicthatisdifferentinmembersofapopulationorsample,suchasheight.Thismeasurementisnotconstant,sothereforeitisvariable.Variablescanbequalitativeorquantitative,continuousordiscrete.Randomvariablescannotbepredictedandarethemostusefulforstatisticalpurposes.

Population: Acollectionofentities.Astatisticalpopulationreferstothelargestcollectionofentitiesinwhichwehaveaninterest.Forexample,wemaybeinterestedinlookingatwomenofreproductiveagewhohavehadonechild.Therefore,ourpopulationislimitedtoonlythosewomenaged1545whohaveonechild.

Sample: Partofapopulation.Asampleoftheexamplepopulationofwomen1545withonechildmightconsistofanestimated25percentofthepopulation.

Parameter:Adescriptivemeasurecomputedfromthedataofapopulation.

Statistic: Adescriptivemeasurecomputedfromthedataofasample.Statisticsisafieldwhichexaminesthecollection,organization,summarization,andanalysisofdataanddrawsinferencesregardingthatdataforapopulationthroughobservationofasample.

IntroductiontoBiostatistics


DescriptiveStatistics: Methodsforpresentingandsummarizingdata.Descriptivestatisticsallowustounderstandgeneralpatternsinalargequantityofdatawithoutconductingaformaltestofahypothesis.

InferentialStatistics:Statisticsusedtoreachaconclusionaboutapopulationbasedoninformationgatheredfromasampleofthatpopulation. Involvesestimationorhypothesistesting.

StatisticalSymbols

:populationmean :populationstandarddeviationx :samplemean s:samplestandarddeviation.50:median

FrequencyDistributions


ScalesofMeasurement

Therearefourcommonlyrecognizedscalesofmeasurementforvariables.

NominalScaleThenominalscaleclassifiespersonsorthingsbasedonaqualitativeassessmentofthecharacteristicbeingassessed.Itneitherincludesinformationonquantityoramountnordoesitindicatemorethanorlessthan.

Example:Gender(maleorfemale)isacommonnominalvariableusedinepidemiologicstudies.

Example:Countrytelephonecodesareanexampleofnumericvariablesthatdonotindicatemoreorless(countrycode82isnotmorethancountrycode37).

OrdinalScaleTheordinalscalealsoclassifiespersonsorthingsbasedonthecharacteristicbeingassessedbutdoesindicatemorethanorlessthan.Inthissense,itprovidesmoreinformationthanthenominalscale. However,theordinalscaledoesnotindicatehowmuchmorethanorlessthan.

Example:Ratingstudentsperformanceasbeingpoor,average,good,orexcellentindicateshowwellstudentsperformandprovidesabasisforcomparison.However,itdoesnotindicatehowmuchbetteranexcellentperformanceiscomparedtoagoodone.

IntervalScaleTheintervalscalehasthesamecharacteristicsoftheordinalscaleclassifyingpersonsorthingsbasedonthecharacteristicassessedandindicatingmorethanorlessthanbuttheintervalscaleindicateshowmuchmorethanorlessthan.Whattheintervalscaledoesnotdoisindicateatruezeropointmeaningthat

Overview

Scalesofmeasurementallowyoutocategorizedatainordertoprovideinformationaboutthecharacteristicbeingmeasured.

Thetypeofscaleusedinmeasuringdataaffectsthetypeandamountofinformationthatcanbeobtained.Thisaffectshowdatawillbetreatedstatistically.

Recognizingthedifferentscalesofmeasurementandunderstandingtheirimplicationsforanalyzingdatawillalsoassistyouincreatingquestionnairesforepidemiologicstudies.



therecannotbeanabsenceofacharacteristicbeingmeasured. Additionally,ratiosmadewithtwonumbersintheintervalscaledonothavemeaning.

Example:Temperatureisanintervalinthatdifferentvaluescantellyouhowmuchmoreorless.However,thereisnotruezeropoint.Thevalueofzerointemperaturedoesnotindicateabsenceoftemperature. Also,whencomparingtwotemperatures,theirratioisnotmeaningful.Wewouldnotsaythata90degreetemperatureistwiceashotasa45degreetemperature.

RatioScaleTheratioscaleincludesallthecharacteristicsoftheintervalscalebutdoesindicateatruezeropoint.

Example:Heightandweightmeasurementsindicatehowmuchmoreorless,butalsohaveatruezeropoint.Aweightofzeroindicatesanabsenceofweight.

ScalesofMeasurement:SUMMARY

Nominal Ordinal Interval Ratio Classifiespersons

orthingsbasedonaqualitativeassessment

Similarordissimilarbutnotmoreorless

Canbenumericbutnothereisnoimplicationofmoreorless

Classifiespersonsorthingsbasedonaqualitativeassessment

Moreorlessbutnothowmuchmoreorless

Indicateshowmuchmoreorless

Doesnotcontainatruezeropoint

Cannotcreatemeaningfulratiosofthesetwonumbers

Includesallthecharacteristicsoftheintervalscale,butcontainsatruezeropoint.

Practice:ScalesofMeasurementIdentifythescaledescribedineachsituationbelow:

1. Temperatureofpatientsatahealthfacility2. Theweightofchildrenunderfiveataweeklybabyweighing3. Thereligionoffamiliesinavillage4. Thelengthoftimespentinthehospital5. Thediagnosisofpatientsuponadmissiontothehospital

RelatedConcepts

FrequencyDistribution




Oneofthemostcommonwaystosummarizedataforbetterunderstandingandclearerpresentationisthroughafrequencydistribution.Afrequencydistributionisapresentationofthenumberoftimes(orthefrequency)thateachvalue(orgroupofvalues)occursinthestudypopulation.

Afrequencydistributionhelpstogiveapictureoftheshapeofthedistributionofthedata. Dataisunimodalifitonlyhasonepeak,bimodalifithastwopeaks,andmultimodaliftherearemorethantwopeaks.Measuresofdispersionwillhelpyoutoform aclearerpictureofthedistributionofthedatabydescribingtheheight,orthespread,ofthedata.Wewilldiscussthisinmoredetailinthesectiontitled,MeasuresofDispersion.

Afrequencydistributioncanbedisplayedasatable,abarchart,ahistogram,orafrequencypolygon. Eachmethodshouldbeclearlylabeledwiththefrequencynumber. Themethodusuallydependsonthetypeofvariablebeingdescribed.

Overview

Frequencydistributionsshowhowofteneachvalueforavariableoccursinasampleorpopulation.

Example:Malariacasesmaybereportedonafrequencybymonthbasisinordertodeterminethehighriskmonthsintheyear.



Categoricalvariablesarequalitativeinnatureandarebestdisplayedasatableorabarchart.

TableAfrequencytablesimplyshowsthenumberoftimeseachspecificobservationappearsinasampleorpopulation.

CasesofMalaria

Frequency

Monday 6Tuesday 4Wednesday 2Thursday 5Friday 3Saturday 4Total 24

BarchartAbarchart,likeatable,displaysthenumberofobservationsforeachvariable,butprovidesabettervisualrepresentation.

CasesofMalaria

0

1

2

3

4

5

6

7

Monday

Tuesday

Wednesday

Thursday

Friday

Saturda

y

Frequen

cy



Numericalvariablesarequantitativeinnatureandarebestdisplayedasafrequencyhistogramorafrequencypolygon.

FrequencyhistogramAfrequencyhistogramshowsthefrequenciesrelativetoeachother.Thewidthofthebarisinproportionwiththeclassintervalthatitrepresents.Typicallytherearenospacesbetweenbarsinafrequencyhistogram,thoughyoumayseethemconstructedinthisfashionattimes.

FrequencyofMalariaCasesinthePastYear

0

5

10

15

20

25

0 1 2 3 3+

NumberofCases

Peo

ple



FrequencypolygonAfrequencypolygonincludesthesameareaunderthelinethatahistogramdisplayswithinthebars. Eachpointrepresentsamidpointinthedata.Thoughafrequencypolygonmaylooklikealinegraph,afrequencypolygonmustbeclosedattheends.

FrequencyofMalariaCasesinthePastyear

0

5

10

15

20

25

. 0 1 2 3 3+ .

NumberofCases

Peo

ple

Numericalvariablesmayneedtobegroupedforpresentationifthenumberofvaluesis largeoritisacontinuousvariable.Theboxbelowgivesguidelinesonhowtogroupvariables.



RelativeFrequency

Oftenitisusefultoknowtheproportionofthevaluesthatfallwithinaspecificcategoryorgroup.Thisisobtainedbydividingthenumberofvaluesatthatcategorybythetotalnumberinthesample.Thisisreferredtoastherelativefrequencyandispresentedasaproportion(valuesfrom0.0to1.0)orapercent(valuesfrom 0%to100%).

Whenreportingeitherthefrequencyortherelativefrequencyintableorgraphform,makesurethatalldataisclearlylabeled.

CasesofMalaria

Frequency Percent CumPercent

Monday 6 25.0 25.0Tuesday 4 16.7 41.7Wednesday 2 8.3 50.0Thursday 5 20.8 70.8Friday 3 12.5 83.3Saturday 4 16.7 100.0Total 24 100.0 100.0

Inthetableabove,therelativefrequencyispresentedasapercentofthewhole.

GroupingVariables

Continuousnumericvariablesmustoftenberegroupedintocategoriesforanalysispurposes.Listedbelowaresomegeneralguidelinestousewhengroupingvariables:

Createclassintervalsthataremutuallyexclusiveandincludealldata.Itshouldbeclearwhereoneintervalstopsandthenextonebegins.Nointervalshouldincludethesamenumbertwice.

Usealargenumberofnarrowclassintervalsfortheinitialanalysis.Allintervalsshouldbethesamesize.Youcancombineintervalslaterifneeded,butitisimpossibletobreakintervalsdownfurtherwithoutreferringbacktotheoriginaldata.

Usenaturalormeaningfulgroupingswhenpossible.Therearemanygroupings,suchasfiveyearageintervalsandbodymassindex(BMI),whichareusedfrequentlyand,therefore,havebecomestandard.SomegroupingshavebeenestablishedbyorganizationssuchasWHOorCDC.

Createaseparatecategoryforunknowns.Thiswillavoidconfusionwhencomparingsubgroupobservations(n)tothetotalnumberofobservations(N).



StepbyStepExample:FrequencyDistributionsUsethedatabelowtocreatefrequencydistributions. Thismightrepresentaclassofmastersstudents.First,createafrequencytableforGender,thendisplaythesameinformationinabarchart.Next,createahistogramofNumberofchildren. Also,displaythisinformationinafrequencypolygon.

Subject Gender Age Numberofchildren

MaritalStatus*

1 M 32 1 M2 M 35 0 M3 F 28 0 S4 M 45 3 D5 F 47 3 M6 F 36 2 D7 M 29 1 S8 M 31 0 S9 F 42 2 D10 F 44 2 M*M=married,S=single,D=divorced

Step Example1. Createafrequency

table.DeterminethenumberofobservationsforeachvariableunderGender.Displaythisinatable.

Gender FrequencyFemale 5Male 5

2. Createabarchart. DisplaythefrequencyoftheobservationsforGenderinabarchart.

GenderofParticipants

0

1

2

3

4

5

6

Male FemaleGender

Frequen

cy



Step Example3. Createahistogram. Displaythefrequencyoftheobservationsfor

Numberofchildreninahistogram.

NumberofChildrenofParticipants

0

0.5

1

1.5

2

2.5

3

3.5

0 1 2 3

NumberofChildren

4. Createafrequencypolygon.

DisplaythefrequencyforNumberofChildrenasapolygon.

NumberofChildrenofParticipants

0

0.5

1

1.5

2

2.5

3

3.5

. 0 1 2 3 .

Children

5. Describethedata. Thereareanequalnumberofmenandwomenparticipatingintheconference. Thefrequencydistributionshowsthatthevariablechildrenisbimodalinnature.Themajorityofparticipantshaveeithernochildrenortwochildren.



Practice:FrequencyDistributionsUsingthefollowingdataset,createvisualrepresentationsofthefrequencydistributionsforthevariables.

Subject Gender Age Numberofchildren

MaritalStatus

1 M 32 1 M2 M 35 0 M3 F 28 0 S4 M 45 3 D5 F 47 3 M6 F 36 2 D7 M 29 1 S8 M 31 0 S9 F 42 2 D10 F 44 2 M

1. Createafrequencytableforthevariable,MaritalStatus.(Includethecumulativepercent.)

2. Showthesameinformationinabarchart.3. Drawafrequencyhistogramforthevariable, Age.Grouptheagesin

intervalsoffivebeforebeginning.4. Displaythesameinformationinafrequencypolygon.

Spacehasbeenprovidedonthefollowingpagestocompleteyourwork.



Step PracticeSpace1. Createafrequency

table.

2. Createabarchart.



Step PracticeSpace3. Createahistogram.

4. Createafrequencypolygon.

5. Describethedataset.



EpiInfoExample:FrequencyDistributionsYouareattendingafictitiousinternationalconference.Demographicdatawascollectedontheattendees.Usewhatyouknowaboutfrequencydistributiontosummarizethedata. First,createatableandabarchartofthecategoricalvariable,Occupation.Then,createahistogramandafrequencypolygonforthecontinuousnumericalvariable,Weight_kg. ThedatasetiscalledFrequency_DistandisfoundintheBios_Workbook_Examples.mdbdatabase.

FrequencyTable

Step Example

1. READthedataset. OpenEpiInfoandchooseAnalyzeData.

SelectREADunderDataAnalysisCommands.

OpenFrequency_Distinthedatabase,Bios_Workbook_Examples.mdb.

2. Createafrequencytable.

SelecttheFREQUENCIEScommand.

IntheFrequencydropdownbox,highlightthevariablethatyouwanttoexamine.Forthisexample,highlightOccupation.

ClickOK.

3. Describethedata. Youshouldseeafrequencytableonyourscreenthatlooksliketheonebelow:

Thischartprovidesinformationonthevariableoccupationbypresentingfrequenciesandrelativefrequencies.



BarChart

1. MakeafrequencybarchartinEpiInfo.

ChooseGRAPHunderStatistics.

IntheGraphTypedropdownbox,chooseBar(default).

Intheboxlabeled1stTitle|2ndTitle,typeOccupationofParticipants.Thisisthetitleofyourchart.

UnderXAxis,chooseOccupationastheMainVariable.

UnderYAxis,ShowValueofCount.(default)

ClickOK.



2. Describethedata. EpiInfowillgiveyouthegraphbelow:

Noticethatthegraphrepresentstheexactnumberslistedinthetablecreatedpreviously.

YoucanmakeabarchartofthepercentageofparticipantsineachoccupationbychoosingShowValueofCount%underYAxis.



Histogram

1. MakeahistograminEpiInfo.

ChooseGRAPHunderStatistics.

UnderGraphType,chooseHistogram.

Createatitleforyourgraph.

ChooseWeight_kgasthemainvariableandShowValueofCount.

NoticewhenyouselectHistogramastheGraphType,youaregiventheoptiontocreateintervals.ThisallowsyoutogroupthevariableWeight_kg,withoutcreatinganewvariable.UsingtheIntervalsoptionmakesthedataeasiertoview.IfyoucreateaFREQUENCIEStableyoucanseethattherearenearly50differentweightsrecorded.Itmaynotbeusefultohaveeachonelistedseparately.

Tocreateintervals,lookatthecolumnmarkedXAxis.Type5inthefirstspaceunderIntervalType45inthespaceunderFirstValue.

ClickOK.

2. Describethedata. Nowthegraphyouseewillpresenttheweightofparticipantsin5kgintervals.



EpiInfoPractice:FrequencyDistributionsUsethedatasetfromthefictitiousconference(Frequency_Dist)onceagaintocreatefrequencydistributionsforHeight_cmandPreferredLanguageinEpiInfo.

1. CreateafrequencytableofPreferredLanguageinEpiInfo.

2. MakeafrequencybarchartofPreferredLanguageinEpiInfo.

3. MakeahistogramofHeightinEpiInfo.

Revieweachofthesedisplaysanddescribethedataset.

Step PracticeSpace

4. Describethedatasetusingthefrequencychartsandgraphsthatyouhavecreated.

ExcelExample:FrequencyDistributionsNowuseExceltocreateafrequencypolygonforthecontinuousnumericalvariable,Weight_kg.ThedatasetiscalledFrequency_DistandisfoundintheBios_Workbook_Examples.mdbdatabase.

1. CreateafrequencypolygoninExcel.

a.OpenExcelandimportthedataset.

Fromthetoolbar,selectData.HighlightImportExternalData.ChooseImportData.LocateFrequency_DistintheBios_Workbook_Examples.mdbdatabase.ClickOpen.

ThedatasetshouldappearasanExcelspreadsheet.



b.CreateafrequencytableforWeight_kg.

CopythevariableWeight_kgbyhighlightingthecolumn.PressCtrl+Ctocopy.ChooseablankcellonthespreadsheetandpastethevariablebypressingCtrl+V.

Inthecellnexttothevariableheading,typeInterval.Completethecolumnbyenteringtheintervalsthatyouhavechosenforthedata.Inthiscase,createintervalsof5,beginningwith4549andcontinuinguntil100104.Youshouldanchortheintervalsbyincluding=105.Thefirstandlastintervalsshouldhaveafrequencyofzero.

ThenextcolumnwillbetitledBin. BinisawordusedbyExceltodefineintervallimits. Inthiscolumn,wetellExcelhowtoreadtheintervalsthatwehavecreated.ThefirstnumberinthebinarraywilltellExceltofindallobservationslessthanorequaltothatnumber,n.Thesecondnumber,p,willtellExceltolocateallobservationsthatoccurbetweenn+1andp.Thiscontinuesuntilthefinalnumberinthebin,whichtellsExceltolocateallnumbersgreaterthanorequaltothatfinalnumber.

Createthebinbytypinginthehighestnumberthatshouldbeincludedinthatinterval.Forthefirstnumberinthebin,Excelwilllookforallobservationslessthanorequaltothatnumber.Forthelastnumberinthebin,Excelwillfindobservationsgreaterthanorequaltothatnumber.

Weight(kg) Intervals BIN Frequency

73 =105 105



677587

YourfinalcolumnwillbecalledFrequency.WewillletExcelcalculatethefrequenciesforus.

HighlighttheFrequencycolumnbyclickingonthefirstcellundertheheadinganddraggingthemouseuntiltheshadedareaequalsthelengthoftheBincolumn.Donotincludethecolumnlabel(Frequency)whenhighlighting.

UnderInsertinthetoolbar,chooseFunction.SelectthefunctionFREQUENCY.Youmayhavetodoasearchforthefrequencyoptionbytypingthewordfrequencyattheprompt.

ClickOK.

Youwillseethefollowingbox:



ClickonthecharticontotherightoftheboxlabeledData_array.HighlightallthevaluesforthevariableWeight_kg.

Clickonthecharticonagaintoreturntothefunctionbox.



c.Createafrequencypolygon.

ClickonthecharticontotherightoftheboxlabeledBins_array.HighlightallthevaluesintheBincolumn.Clickonthecharticonagaintoreturntothefunctionbox.

PressControlandShifttogetherandhitEnterwhilecontinuingtoholdtheothertwokeysdown.(DONOTCLICKOK!)

Thenumberofobservationsincludedineachintervalwillbeshowninthechart.Younowhaveafrequencytable.Notethatthereisafrequencyofzeroatthehighendandatthelowendoftheweightintervals.Youwillneedthisinordertocreateafrequencypolygoncorrectly.

Usingthefrequencytablethatyoujustmade,highlightallthevaluesinthefrequencycolumn.

UnderInsertinthetoolbar,selectChart.

ChooseChartType:Line.Thefirstlinegraphinthesecondrowispreferredbecauseitshowsthemidpointsinthegraph.

ClickNext.

Afrequencypolygonwillappear.



Tocorrectlylabelthepolygon,choosetheSeriestab.

ClickthecharticonnexttotheboxlabeledCategory(X)axislabels.

Highlightthevaluesinthecolumn,Intervals.

Yourchartshouldnowbelabeledsimilartotheonebelow:

ClickNext.

ChooseTitletogiveyourchartatitleandlabeltheXaxis.

ClickFinish.



2. Describethedata.WeightofConferenceParticipants

0

1

2

3

4

5

6

7

8

9

=105

Weightinkg

Thisdistributionisunimodalbecauseonepeakishigherthantherest.Themajorityofparticipantsweightsfalltotheleftofthepeak.Mostparticipantsweighlessthan84kg.

ExcelPractice:FrequencyDistributionsUsethedatasetfromthefictitiousconference(Frequency_Dist)tocreateafrequencypolygonforHeight_cminExcel.

1. CreateafrequencypolygonofHeightinExcel.

Useyourgraphtoanswerthefollowingquestions.

Step PracticeSpace

2. Describethedatasetusingthefrequencypolygon.



3. HowisthissimilartothehistogramthatyoucreatedinEpiInfo?

RelatedConcepts

CentralLocationandDispersion




Measuresofcentrallocationanddispersionaregenerallyreferredtoasdescriptivestatisticsbecausetheydescribethedistributionofthedataset.

Frequencydistributionprovidesapictureofthenumberoftimesthatavariableoccurs,butrevealsnothingaboutthespreadofthedata. Inordertogainaclearerpictureofhowdataisdistributed,wewillcalculate:

Measuresofcentraltendency:mean,median,mode,range Measuresofdispersion:variance,standarddeviation,andstandarderror

Throughthesemeasures,thedatabeginstotakeshape.Whencombinedwithfrequencydistribution,wecanvisualizethedistributionofthedata. Weobtainthenumberandheightofthepeaksinthedistributionfromthefrequency.Measuresofdispersionallowustoobtainanideaofthewidth,orthespreadofthedistributionofthedata.

Datacanbeeithersymmetricorskewed.Ifthedatacanbedividedintopiecesthatareverysimilartoeachother,wecansaythatthedataissymmetric.Ifonetailofaunimodaldistributionislongerthantheothertail,thenthedataisskewed,meaningthatthedataisnotspreadevenly.Datacanbeeitherrightskewedorleftskewed. Ifdataisskewedtotheright,itwillrisequicklytoapeakandhavealongtailontheright.Theoppositeistruefordatathatisskewedtotheleft.



MeasuresofCentralTendency

MeanThemeanissimplythearithmeticaverageofthedataandiscalculatedbytakingthesumofallvaluesinthenumbersetanddividingthattotalbythenumberofvaluesinthedataset. Themeanisthemostcommonlyusedmeasureofcentraltendency.

n

xx =

MedianThemedianisthe50thpercentileofthevaluesinadatasetandrepresentstheliteralmiddleofthedata.Themedianisfoundbyarrangingallvaluesinthedatasetinnumericalorderandthenchoosingthemiddlevalue. Ifthenumberofvaluesinadatasetiseven,takethemeanofthetwomiddlenumberstofindthemedian.

ModeThemoderepresentsthevaluethatisfoundmostfrequentlyinasetofnumbers.Notethatitispossibletohavemorethanonemode. Inthefollowingsetofnumbers,{87889656467},themodeisboth8and6,sinceeachisincludedinthedatasetthreetimes. Thisdatasetisreferredtoasbimodalbecauseithastwomodes. Itisalsopossiblenottohaveamodeinasetofnumbers.Inthefollowingsetofnumbers,{5497638},thereisnonumberwhichoccursmorefrequentlythananyother.Therefore,thereisnomode.

Overview

Measuresofcentraltendencyareusedtodescribethedatainthesamplebygivinganideaofthecenterandthedistributionofthedata.

Therearethreecommonmeasuresofcentraltendency:mean,medianandmode.

Formula:Forinstance,thearithmeticmeaniscalculatedasfollows:

n

xx =



Comparisonofmean,median,andmodeWhenyouaretoldtoaveragethedata,itisgenerallyexpectedthatyouwilltakethemean.Technically,however,theaveragecouldrefertothemean,themedian,orthemodeofthedata.Themeanisabletogiveusthemostinformationaboutthedatasetasawhole,especiallywhencombinedwiththestandarddeviation.Therefore,weprefertousethemeanwhenwecan.

Therearecertainadvantagestothemedian. Themedianisresistanttoskewing,theresultofanoutliercausingthemeanofthedatatoshifteithertotheleftortotheright. Itisnotaffectedbyextremevalueslikethemeanisanditismorerepresentativeofthecenterofdatawhendataisasymmetrical.

Letsconsiderskeweddata.LookatthegraphofthepopulationdistributionbystateintheUnitedStates.

PopulationoftheUnitedStatesbyState

0

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

30,000,000

35,000,000

40,000,000

.Califo

rnia

.Tex

as

.New

York

.Florid

a.Illinois

.Pen

nsylva

nia

.Ohio

.Michiga

n.G

eorgia

.New

Jerse

y.NorthCarolina

.Virg

inia

.Mas

sach

usetts

.Was

hing

ton

.Indian

a.Ten

nessee

.Ariz

ona

.Misso

uri

.Marylan

d.W

isco

nsin

.Minne

sota

.Colorad

o.Alaba

ma

.Lou

isiana

.Sou

thCarolina

.Ken

tuck

y.O

rego

n.O

klah

oma

.Con

necticut

.Iowa

.Mississippi

.Arkan

sas

.Kan

sas

.Utah

.Nev

ada

.New

Mex

ico

.Wes

tVirg

inia

.Neb

rask

a.Id

aho

.Maine

.New

Ham

pshire

.Haw

aii

.Rho

deIs

land

.M

ontana

.Delaw

are

.Sou

thDak

ota

.Alask

a.NorthDak

ota

.Vermon

t.Districto

f.W

yoming

State

Population

Thestatesappearingontheleftsideofthehistogramhaveasignificantlylargerpopulationthanotherstates.Becauseofthis,weexpectthemeantobehigherinvaluethanthemedian.Thecalculatedmeaninthissampleis5,811,968.706,whichisjustmarkedonthegraphabove.Themedianis4,173,405,alsomarkedonthegraph. Themeaninthisexampleisgreaterthanthemedian. Ageneralruletofollowisthatifthedataisskewedeithertotheleftortotheright,themedianrepresentsthedatabetterthanthemean. Ifasampleisnormallydistributed,themeanandmedianwillbenearlythesame.Withsymmetricaldata,themodewillbesimilaraswell.

Mean Median

UnitedStatesPopulationbyState



Whenthesamplesizeissmall,themodemayrepresentthedatamostaccurately. Itispossiblethatinbimodaldata,themodeswillbeamoreaccuratedescriptionaswell.Themodeisalsofrequentlyusedtodescribequalitativedata.Forexample,youmightfindamodaldiagnosis,orusethemodetodescribemedicaldiagnosesbystatingthediagnosisthatwasseenmostfrequentlyoveragivenperiodoftime.

StepbyStepExample:Mean,Median,ModeThefollowingareagesofpatientsseenbythedoctorforabrokenboneinthepastmonth:

15 17 20 14 16 15 17 22 18 13 15 14 16 18 20

Usethedatatoanswerthefollowingquestions:

Whatisthemeanageofthepatients?Whatisthemedianageofthepatients?Whatisthemodalageofthepatients?Whichmeasureisthemostrepresentativeofthesample?

Step Example1. Findthe

mean, x ,ofthesample.

x =n

x =

15201816141513182217151614201715 + + + + + + + + + + + + + + =

15250

=16.7

2. Findthemedianofthesample.

Firstlinethenumbersupinnumericalorder:131414151515161617171818202022

Findthemiddlenumber:131414151515161617171818202022

Thereare7numbersoneithersideofthearrow,thus16isthemedian.

3. Findthemodeofthesample.

131414151515161617171818202022

Thenumberthatappearsmost,atthreetimes,inthisdatasetis15.Therefore,15isthemode.



Step Example4. Which

statisticismostrepresentativeofthecenterofthedataset?

Inthiscase,themeanandthemedianarenearlyequal.Therefore,wecanassumethatthecurveisnormallydistributedandthemeanrepresentsthecenterofthecurve.Ifthemeanandthemedianaredifferent,wecanassumethatthedataisskewedandthemedianwillgenerallybemoreappropriate.

Practice:Mean,Median,ModeInordertodetermineifthereisarelationshipbetweenageandthenumberofvisitstothedoctor,youdecidetocountthenumberofdoctorvisitsthatindividualsmakeoverthecourseofayear.Belowisthedatathatyouhavecollected:

Individual Age Visits1 45 152 60 83 52 224 46 95 23 26 52 157 37 38 33 13

Describetheaverageageofyoursampleandtheaveragenumberofdoctorvisitsmadebyanindividualusingthemean,median,andmode.

Step PracticeSpace1. Findthemean, x .

x =n

x

2. Findthemedian.



Step PracticeSpace3. Findthemode.

4. Whichstatisticismostrepresentativeofthecenterofthedatasetandwhy?

EpiInfoExample:Mean,Median,ModeUsingthesamedatathatwepracticedwithbeforeonpage36,wecanfindthemean,median,andmodeintwosimplestepsusingEpiInfo.

Step Example1. UseEpiInfoto

determinedescriptivestatistics.

a. READthedataset.

OpenEpiInfoandchooseAnalyzeData.

SelectREADinDataAnalysisCommands.

HighlightCentral_TendencyfromtheDataSourceBios_Workbook_Examples.

ClickOK.

b. FindtheMEANSofthedata.

SelectMEANSfromtheCommandscolumnunderStatistics.

ChooseAgefromthedropdownboxunderMeansof.

ClickOK.



Step Example2. Identifythemean,

median,andmodeofthedata.

Thisistheoutputthatyoushouldsee:

Theoutputgivesyouthemean,themedian,andthemode.EpiInforeportsthemeantobe16.7,themediantobe16.0,andthemodetobe15.0.Thisdoesnotdifferfromthehandcalculationsthatweperformedpreviously.

3. Interprettheresults.

Aswedeterminedearlier,themeanandthemedianarenearlyequal. Therefore,wecanassumethatthecurveisnormallydistributedandthemeanrepresentsthecenterofthecurve.Ifthemeanandthemedianaredifferent,wecanassumethatthedataisskewedandthemedianwillgenerallybemoreappropriate.

EpiInfoPractice:Mean,Median,ModeYouareweighingbabiesfrom9AMto11AMatanunderfiveclinicinthevillage.Yourresultsareasfollows:

Age(months)

Length(cm)

Weight(kg)

21 77 9.834 87 11.523 84 10.830 92 14.027 85 12.024 82 10.831 87 11.626 85 11.822 85 12.432 86 12.0

UseEpiInfotofindthemean,median,andmode. Then,answerthequestionsthatfollow. ThedatasetyouareworkingfromiscalledBabyWeighing.RemembertoopenthedatasetinEpiInfobyusingtheREADcommand.



Step PracticeSpace1. Identifythemean,

median,andmodeofthedata.

Length: Weight:

Mean______ Mean______

Median_____ Median_____

Mode______ Mode______

2. Whatistheaveragelengthandweightofbabiesthatcameintothecliniconthismorning?

3. Whatcanyoudetermineaboutthedistributionofthedatabasedonyourresults?

MeasuresofDispersionRelatedConcepts

MeasuresofDispersionNormalDistribution



MeasuresofDispersion

Intheprevioussection,wediscussedmethodsofdescribingthecenterofthedata.Nowwewanttoexaminewaystodescribethespreadofthedata,orhowfareachdatapointisfromthecenter.

Range:Therangeofthedataisthedifferencebetweenthesmallestobservation(minimumvalue)andthelargestobservation(maximumvalue)inasetofdata.Therangeiscalculatedbyfindingthedifferencebetweenthemaximumvalueandtheminimumvalueinasetofdata.

range=maximum minimum

InterquartileRange(IQR): Theinterquartilerangeisthedifferencebetweenthe25thpercentile(1stquartile)andthe75thpercentile(3rdquartile)inasetofdata.Thismeasurementgivesanideaofthemiddle50percentoftheobservationsandis,therefore,lesslikelytobeinfluencedbyoutliersorextremevalues.

IQR4

)1n(4

)1n(3 + -

+ =

Overview

Measuresofdispersiondescribevariabilityofdatainasamplebydescribingthespreadofthedata.

Formulas:Range=maximum minimum

InterquartileRange=4

)1n(4

)1n(3 + -

+ =

Variance= 2in

1i

2 )xx()1n(

1s -

- = S

=

OR)1n(n

)x(xn 2i2i

- -

Standarddeviation= 2ss =

Standarderror=n

sSE =



Variance(s2): Thevariancerepresentstheamountofspreadorvariabilityaroundthemeanofasetofdata. Becausethevarianceisinunitssquared,wefindthestandarddeviationtodescribeourdataintheproperunits. Thesymbols2 isusedwhenwearereferringtothevarianceofasampleandthesymbol2

(pronouncedsigmasquared)whenwearereferringtothevarianceofapopulation.

2i

n

1i

2 )xx()1n(

1s -

- = S

=

OR)1n(n

)x(xn 2i2i

- -

StandardDeviation(s): Thestandarddeviationofasetofdataisthesquarerootofthevariance. Itdescribestheaveragedistanceofallobservationsfromthemeanofthesampleandisusedasvariabilitytodescribethespreadofthedata.Alargestandarddeviationrepresentsawidespreadbecausetheobservationsarefarfromthemean. Whenwerefertothestandarddeviationofapopulation,weusethesymbol(sigma).

2ss =

StandardError(SE): Thestandarderroristhestandarddeviationofthesamplingdistributionofthemeans,ratherthantheobservationsthemselves.Thesmallerthestandarderror,thecloseranygivensamplemeanislikelytobetothetruepopulationmean.

n

sSE =

StepbyStepExample:MeasuresofDispersionUsingthedatabelow,followtheinstructionstoidentifythemeasuresofdispersionforAge.




Minimum,maximum,andrange

Step Example1. Identifytheminimum

valueofAge.Theminimumvalueisthelowestvalueinthesample.Inthiscase,itis23.

2. IdentifythemaximumvalueofAge.

Themaximumvalueisthehighestvalueinthesample.Inthiscaseitis60.

3. DeterminetherangeofAge.

maxmin=range

6023=37

37istherangeofthesample.

4. Stateyourconclusions.

TheobservationsinAgecoverarangeof37years.

InterquartileRange

Step Example1. Arrangeobservations

ofthevariableAgeinorderofincreasingvalue.

1)232)333)374)455)466)527)528)60

2. Findthepositionofthe1st (Q1)and3rd

(Q3)quartiles.

4)1n(

Q1 +

= 4

)1n(3Q3

+ =

25.2=4

)1+8(=Q1

75.6=4

)1+8(3=Q3



Step Example3. Locateeachnumber

indicatedinthedataset.

Q1,withapositionof2.25,isonefourthofthewaybetweenthe2ndand3rdobservationsintheset.The2ndvalueis33andthe3rd is37,so

34133)3337(41

331 = + = - + = Q

Q3,withapositionof6.75,isthreefourthsofthewaybetweenthe6thand7thobservationsintheset.The6thvalueis52andthe7thvalueisalso52.Therefore,Q3=52.

4. FindthedifferencebetweenQ1andQ3todeterminetheinterquartilerange.

Q3Q1=IQR

Q1=34Q3=52

5234=18


The50thpercentileofthedatahasarangeof18.ThismeansthatthemiddlehalfofalltheobservationsinAgeisspreadacross18years.

Variance,standarddeviation,andstandarderror

Step Example1. Findthemeanof

thedataset.1)232)333)374)455)466)527)528)60

5.43=8

348=

860+52+52+46+45+37+33+23

=x



2. Calculatethevarianceusingtheformulabelow.

2i

n

1i

2 )xx()1n(

1s -

- = S

=

])5.4360(+)5.4352(2+)5.4346(+)5.4345(

+)5.4337(+)5.4333(+)5.4323[()18(

1=s

2222

2222

]25.272+)25.72(2

+25.6+25.2+25.42+25.110+25.420[71

=s2

99871

=s2

57.142s2 =

3. Calculatethestandarddeviation.

2ss =

57.142=s

s=11.94

4. Calculatethestandarderrorofthemeans.

n

sSE =

8

94.11SE =

SE=4.22

5. Stateyourconclusions

Theobservationsareanaverageof11.94yearsawayfromthemean.Ifweweretotakemanysamplesfromthesamepopulation,theaverageofthesamplemeanswouldbe4.44yearsfromtheactualpopulationmean.



Practice:MeasuresofDispersionUsethesamedatasettodescribethedispersionoftheobservationsofthevariableVisits.


Minimum,maximum,andrange

Step PracticeSpace1. Identifytheminimum

valueofVisits.

2. IdentifythemaximumvalueofVisits.

3. DeterminetherangeofVisits.

maxmin=range




InterquartileRange

Step PracticeSpace1. Arrangeobservations

ofthevariableVisitsinorderofincreasingvalue.

2. Findthepositionofthe1st (Q1)and3rd

(Q3)quartiles.

4)1n(

Q1 +

= 4

)1n(3Q3

+ =

3. Locateeachnumberindicatedinthedataset.

4. FindthedifferencebetweenQ1andQ3todeterminetheinterquartilerange.

Q3Q1=IQR




Variance,standarddeviation,andstandarderror

Step PracticeSpace1. Findthemeanofthe

variableVisits.

2. Calculatethevarianceusingtheformulabelow.

2i

n

1i

2 )xx()1n(

1s -

- = S

=

3. Calculatethestandarddeviation.

2ss =

4. Calculatethestandarderrorofthemeans.

n

sSE =




EpiInfoExample:MeasuresofDispersionUsethetablebelow(datasetBabyWeighing)tofindmeasuresofdispersionforthevariableAgeinEpiInfo.Firstfindthemaximum,minimum,range,andinterquartilerange.Thencalculatethevariance,thestandarddeviation,andthestandarderror.

Step Example1. READthedatasetin

EpiInfo.OpenEpiInfoandchooseAnalyzeData.

SelectREADandopenthedatabase,Bios_Workbook_Examples.ChoosethedatasetBabyWeighing.

ClickOK.

2. FindtheMEANSofthedataset.

SelectMEANSundertheStatisticsheading.

InthedropdownmenuforMeansOf,chooseAge_in_months.

ClickOK.

Age(months)

Length(cm)

Weight(kg)

21 77 9.834 87 11.523 84 10.830 92 14.027 85 12.024 82 10.831 87 11.626 85 11.822 85 12.432 86 12.0



Step Example3. Usetheoutputto

determinetherangeandtheinterquartilerange.

Theoutputprovidesyouwiththemaximumandtheminimuminthedata.Findthedifferencetodeterminetherange.

Range=maximumminimumRange=3421=13

Theoutputalsoprovidesthe25thpercentile,equaltoQ1,andthe75thpercentile,equaltoQ3,sothatwecandeterminetheinterquartilerange.

IQR=Q3Q1IQR=3123=8

4. Usetheoutputtoidentifythevarianceandstandarddeviationofthevariable.

Variance=20.67StandardDeviation=4.55

Ifwewanttocalculatethestandarderror,wesimplydividethestandarddeviationbythesquarerootofthenumberofobservations:

44.110

5461.4SE = =



Step Example5. Describethevariable

intermsofdispersion.TherangeofthevariableAge_in_monthsis13months.Themiddlehalfofthedataspans8months.Theaveragedistanceofeachobservationfromthemeanofthedatais4.55months.Ifweweretotakemanysamplesfromthesamepopulation,wewouldfindthattheaveragesamplemeanis1.44monthsfromtheactualpopulationmean.

EpiInfoPractice:MeasuresofDispersionUsethesamedataset,BabyWeighing,topracticedescribingdataintermsofdispersionwiththehelpofEpiInfo.Determinetherangeandinterquartilerangeandidentifythevariance,standarddeviation,andthestandarderrorofthevariableLength.

FindtheMEANSofthedatasetinEpiInfo.

Usetheoutputtoanswerthefollowingquestions.

Step PracticeSpace1. Determinetherange

andtheinterquartilerange.

Range=

IQR=

2. Identifythevarianceandstandarddeviationofthevariable.

s=______

s2=______

3. Describethevariableintermsofdispersion.

RelatedConcepts

NormalDistribution

ProbabilityandtheNormalDistribution


Probability andtheNormalDistribution

Uptothispoint,wehavefocusedondescriptivestatistics.Wehavesimplybeenorganizingandsummarizingdatathathasbeencollected.Wealsowanttoexploresomemethodsfordrawingconclusionsaboutpopulationsbasedsolelyondatathatwehaveforasampleofthatpopulation. Becausewecanneverbecertainthatourconclusionsbasedonthissampleaccuratelyrepresentthetargetpopulation,werefertothisasinferentialstatistics.Inferentialstatisticsisbasedonprobabilitytheory,orthescienceofuncertainty.Thefollowingsectionsdescribehowprobabilitytheoryallowsustomakeinferencesaboutapopulationbasedondataobtainedfromasampleofthatpopulation.

NormalDistribution


ProbabilityDistribution

Probabilityisanindicatorofthelikelihoodthataneventorconditionwilloccur.Somedescribeitasthelongrunrelativefrequencyoftheeventinrepeatedtrialsundersimilarconditions.Itreflectstheproportionofthepopulationwiththeconditionorevent.Forexample,if40%ofworkersinafactoryarefemale,theprobabilitythatarandomlyselectedworkerwillbeafemaleis40%orstatedanotherwayifwerandomlyselectnworkers,theexpectednumberoffemalesinthesampleisnx40%. Alternatively,theexpectednumberofmalesisnx(100%40%),ornx60%.

Probabilitycanalsobeusedtoconsidercontinuousvariables(notjustconditionsoreventsasnotedabove).Itcanindicatethelikelihoodofavalueinaparticularrange.Forexample,if5%ofmenatthefactoryhaveaheightover180cm,theprobabilitythatarandomlyselectedmanwillhaveaheightover180cmis5%.

Probabilitydistributionsrepresenttheprobabilityofthedifferentoutcomes(e.g.male,female)forasampleselection.Therelationshipbetweenthevaluesofavariableandtheprobabilitiesoftheiroccurrencecanbesummarizedinaprobabilitydistribution.

Ifweselectasingleworkerfromthisfactory,theprobabilitydistributionforthepossibleoutcomesforgenderissimple.

Possibleoutcome ProbabilityMale 0.60Female 0.40

Ifweselectthreeworkersthentheprobabilitydistributionbecomesmorecomplicated.

Possibleoutcomes ProbabilityAllmale 0.216=(0.60x0.60x0.60)2male,1female 0.432=(0.60x0.60x0.40)2female,1male 0.288=(0.40x0.40x0.60)

Overview

Aprobabilitydistributionisadistributionofdatabasedonthelikelihoodthataneventorindicatorwilloccurinasampleofthepopulation.

Knowledgeoftheprobabilitydistributionofavariableallowsustodrawconclusionsaboutapopulationbasedondatatakenfromasampleofthatpopulation.

NormalDistribution


Allfemale 0.064=(0.40x0.40x0.40)

Thereareseveralmodelortheoreticalprobabilitydistributionsthatwillallowustodeterminetheprobabilityofagivenvalueforarandomvariableevenifwedonothave(orknow)thefullprobabilitydistributionforthatvariable.Theseprobabilitydistributionsaregivenorcalculatedbymathematicalformulaecalledprobabilityfunctions. Wecanapplythemodeltocreateaprobabilitydensitycurvewheretheheightofthecurvereflectsthefrequencyoftheindividualvaluesandtheareasinanintervalunderthecurvereflectstheproportionofapopulationinthatinterval.Thisisalsoaprobabilitydistribution.

Examplesofprobabilityandotherdistributionsincludethenormal,binomial,Poisson,Chisquare,F,andtdistributions. Forthesakeofsimplicity,theonlydistributionwewillcoverinthisworkbookisthenormaldistribution.

RelatedConcepts

NormalDistribution

NormalDistribution


NormalDistribution

Thenormaldistributionisthemostfamousandimportantofthetheoreticalprobabilitydistributionsfortwomainreasons.First,formanyvariablesweencounterinthehealthfield(e.g.height,bloodpressure,hemoglobinlevel,etc.),itisagooddescriptionofthedistributionofthevariable.Secondlyandmoreimportantly,thenormaldistributionhasacentralroleinstatisticalanalysisasitisusedastheprobabilitydistributionofthesamplemeans. Calculationsbasedonthenormaldistributionareusedtoderiveconfidenceintervalsanddeterminepvaluesforquantitativedata,proportions,andrates.

Characteristicsofanormaldistribution:

Itisspecifiedbytwoparameters:thepopulationmeanandthestandarddeviation.

Itissymmetricalaroundthemean,bellshaped,andunimodal.Thisiswhythenormalcurveisfrequentlyreferredtoasthebellcurve.

Themean,median,andmode,areallinthemiddleofthecurve. Thetotalareaunderthecurveabovethexaxisisonesquareunitwith

50%oftheareatotherightofthemeanand50%totheleftofthemean.AccordingtotheEmpiricalRule: Theareaboundedbyonestandarddeviationtotherightandonestandard

deviationtotheleftofthemeanwillrepresentsapproximately68%ofthevalues.

Theareaboundedbytwostandarddeviationstotherightandtwototheleftwillrepresentsapproximately95%ofthevalues.

99.7%ofthevalueswillbewithinthreestandarddeviationsofthemean.Thisisdemonstratedinthegraphonthenextpage:

Overview

Thenormaldistributionisabellshapedcurvewithboththemeanandthemedianatthecenterofthecurve.

Thestandardnormaldistributionisadistributionofdatawithameanofzeroandastandarddeviationofone.Itallowsdifferentpopulationstobecomparedtoeachother.

Formula:Theformulabelowisusedtocalculatethestandardscore,orthezscorewhencomparingnormallydistributedpopulations.

x

=z

NormalDistribution


Knowingthemeanandstandarddeviationofanormaldistributionallowsonetodeterminethefollowingvalues:

Theproportionofindividualswhofallintoanyrangeofvalues Thepercentileatwhichagivenvaluefalls Thevaluewhichcorrespondstoagivenpercentile

BelowisafrequencydistributionoftheheightofmenintheUSpopulation,characterizedbyanormaldistributionwithameanof171.5cmandastandarddeviationof6.5cm.

=171.5cm

NormalDistribution


GiventhatthemeanheightofthemenintheUSis171.5cm(=171.5cm)andthestandarddeviationis6.5cm(=6.5cm)andusingourknowledgeofthenormalcurve,weknowthefollowinginformation:

68.3%ofmenarebetween165and178cm ( 1=171.5 6.5) 95.5%ofmenarebetween158.5and184.5cm( 2=171.5 2x6.5)

Whatifwewanttoknowspecificinformationsuchas:

Whatproportionofmenareover180cm? Whatheightvalueisatthe10thpercentile?

Statisticianshavedevisedamethodtotransformallnormaldistributionssothattheyusethesamescale.Thisisknownasthestandardnormaldistribution.Thestandardnormaldistributionisanormaldistributionwithameanof0andastandarddeviationof1. Anormaldistributioncanbecomparedwithothernormaldistributionsbyconvertingittoastandardnormaldistributionusingtheformulashownbelow. Thestandardnormaldistributionspecifieshowfaranindividualvalueisfromthemeaninunitsofthestandarddeviation,whichallowsustocalculateastandardscore.Thestandardscoreisawayofexpressinganindividualvalueintermsofstandarddeviationunits.Thestandardscore,referredtoasthezscore,iscalculatedas (observedvaluemean)dividedbythestandarddeviation.Theformulaisbelow:

x

=z

Thezscorewillalsobereferredtoasateststatistic.Eachdistributionhasacorrespondingteststatistic.Thezscorecorrespondswiththestandardnormaldistribution.

NormalDistribution


Example:UsingtheStandardNormalDistributionGivenanormaldistributionofmaleheightswith=171.5cmand=6.5cm,whatistheproportionofmentallerthan180cm?

5.65.171180

=x

=z

31.1=5.65.8

=z

Nowthatweknowthezscore,wemustfindtheareaofthestandardnormalcurveabove1.31.

Inordertofindtheareaofthecurvethatisrepresentedbythezscore,1.31,wemustrefertothestandardnormalzdistributionlocatedinAppendix2.

OntheStandardNormalzTable,locatethezscore1.31. Underthecolumnlabeledz,findthevalue,1.3.Therowlabeledzwillprovideyouwiththehundredthsplaceofyourzscore,sofollowitoveruntil0.01.Ifyouplaceonefingeron1.3andononefingeron0.01andfollowthosepathsuntilyourtwofingersmeet,youfindthevalue,0.9049. UsetheexcerptfromtheStandardNormalzTableonthefollowingpagetohelpyoulocatethezscore.

0 1.31

NormalDistribution


ThistablewillgiveustheareaofthecurvelocatedtotheLEFTofthezscore.Asyoucanseebythediagram,wewanttofindtheareaofthecurvelocatedtotheRIGHTofthezscore. Tofindtheareatotherightofthezscore,wesubtract0.9049from1.

10.9049=0.0951

Therefore,approximately9.5%(0.0951x100%)ofthecurveisabove180cm(orabove1.31SDofthemean).Wecanalsosaythatmenwhoseheightsare180cmandabovearetallerthan90.5%ofAmericanmen. Thus,aheightof180cmrepresentsthe90thpercentile.

Topracticeusingthetableforthestandardnormaldistribution,answerthefollowingquestion.

Whatheightvalueisatthe10thpercentile? Wemanipulatetheformulatosolveforxratherthanz:

x=+(z )where:

xistheobservedvalue isthepopulationmean(given) isthepopulationstandarddeviation(given) zcomesfromthestandardnormaldistribution

NormalDistribution


Tofindtheanswertothisproblem,firstlookupthezscorefromthetableinAppendix2whichcorrespondstothelowest10%oftheareabeneaththecurve.Thisareawillbeonthelefthandsideofthecurve. Dothisbyreversingthestepswepreviouslyusedtofindthearea.

Locatetheareaclosestto0.10intheztable.Thenfollowtherowandcolumntoidentifythezscorethatitisassociatedwith.Youshouldfindazscoreof1.28.

x=+(z )x=171.5+(1.28x6.5)x=171.58.3525=163.1475

The10thpercentileis163.1cm.Thismeansthat10%ofAmericanmenare163.1cmorshorterand90%ofAmericanmenaretallerthan163.1cm.

Practice:UsingtheStandardNormalDistributionYouhaveattendedanHIV/AIDStrainingwhereapretestandaposttestwasgiveninordertomeasureknowledgegained.Pretestscoresareincludedinthetablebelow.Usethetabletoanswerthefollowingquestions.

PretestScores:HIVKnowledge

Females Males

Mean 60 40

SD 12 10

N 138 97

1. Ifamalegetsascoreof70,whatishiszscore?2. Whatisthezscoreforafemalewithascoreof35?3. Whatscoreforfemalesisequivalenttoamalesscoreof78?

RelatedConcepts

CentralLimitTheorem

CentralLimitTheroem


CentralLimitTheorem

Notalldataisnormallydistributed.Datathatisnotnormallydistributedrequiresdifferenttestsinordertoproperlyanalyzeandcompareit.Fortunately,ifwehaveanadequatelylargesamplesize,(n>30),thesamplingdistributiontendstoapproachnormalityandweareabletotreatitasnormal.ThisconceptisknownastheCentralLimitTheorem.

Justaswecalculatedthestandarddeviationforadistributionofindividualvaluesaroundamean,wenowcancalculateasimilarmeasureofvariabilityforaseriesofsamplesfromthepopulation.ThisistheStandardErrorofthestatisticandmeasurestheprecisionofthestatistic(meanorproportion)asanestimateofthepopulationmeanorpopulationproportion.Itindicatesthedegreetowhichasamplestatisticreflectsthetruepopulationvalue.

Thestandarderroristhebasisforcalculatingconfidenceintervalsandconductinghypothesistestsformeansandproportions.Thisallowsustomakegeneralizationsaboutalargergroupofindividualsbasedonasubsetorsample.

Asyouknow,mostepidemiologicstudiesarecarriedoutwiththeaimoflearningaboutacharacteristicinatargetpopulation.Itisrarelyfeasibletostudyeveryindividual.Therefore,weusuallycompareexposuresordiseasewithinasampleofthepopulation.Amajorroleofstatisticsistoallowustogeneralizeresultsfromasampletothelargegroupandunderstandhowaccuratelythatgeneralizationreflectstheactualpopulationmean(orproportion).

Overview

Thesamplingdistributionofsamplestatistics(meanorproportion)willlooknormallydistributedforlargesamplesizes.

Simply,ifthesamplesizeislarge(typicallyn>30),thedistributionofsamplemeansorsampleproportionsapproximatesanormaldistribution.

Formula:

n

s=SE

CentralLimitTheroem


Thus,standarderrorbecomessmallerasngetsbigger,meaningthatthelargerthesamplesize,themoreprobableitisthatthesamplemean, x ,approachesthepopulationmean,.

RelatedConcepts

StatisticalInference

StandardDeviationVs.StandardError

Botharemeasuresofvariationinadataset.

Standarddeviationisameasureofvariation ofindividualobservationsfromthemeaninasetofdata.

Standarderrorofthemeanmeasuresthestandarddeviationofthesamplemeans.




Forindividualvaluesweusethezscoretotellushowfaranindividualvalueisfromthemeanofthesample.Anysamplewillhaveanelementofrandomerror,meaningthatbychanceitmaynotlookexactlylikethepopulationfromwhichitwasdrawn.Inferentialstatisticsallowsustoquantifytheamountofrandomerror.

Thestepsforconductinginferentialstatisticaltestsaresimilarforeachtest:

1. Statethenullandalternativehypotheses.2. Determinethedecisionrule.3. Conducttheappropriatetest.4. Interprettheresults.

1. StatethenullandalternativehypothesesHypothesesareformulatedbasedonprovingordisprovingthestatusquo,orwhatwecurrentlyregardtobeastrue.Eachtimewetestanewidea,weareinactualitycomparingittoouroldideaofwhatalreadyisknown.Forexample,ifweknowchloroquinetobeaneffectivemalariadrug,thenwhenwetesttheeffectivenessofanewdrugsuchassulfadoxinepyrimethamine,weusetheolddrug,chloroquine,asthebaseline.Thus,ourexpectationisthatchloroquineworksandtherewillbenodifferencefoundbyusingthenewdrug.Thisbecomesthenullhypothesis,orH0.Thealternatehypothesis(HA),oftenreferredtoastheresearchhypothesis,thenrepresentsthechancethatasignificantdifferenceisfoundbetweenthenewdrugandtheolddrug.Asweknow,adifferencecanbeeitherhigherorlower,betterorworse.Ifwearetestingforanydifference,wewilluseatwotailedtest.Ifwearetestingtoseeinwhichdirectionthedifferencelies,weuseaonetailedtest.Usingthesamelevelofsignificance(alphavalue),atwotailedtestismorestringentthanaonetailedtest.

2. DeterminethedecisionruleAnalphavalue()determinesthelevelofsignificanceatwhichyouwillconductyourtest.Thisvalueischosenbytheresearcher.Themostcommonalphavalueseenandonewhichisconsideredanacceptablelevelofsignificancebyresearchersworldwideis0.05,or5percent.Youwillalsoseeanalphavalueof0.10,butanythingbelowthatisgenerallyconsideredtobetoolenienttoaccountfordifferencesbeyondthosewhicharerandomorcoincidentaloccurrences.

Wecangenerallydeterminetheresultsofhypothesistestinginthreeways:1)bycomparingacalculatedvalue(tcalc)toacriticalvalue(tcrit)2)bycomparingthealphavaluetoapvalue,and3)bydeterminingifthevaluespecifiedinthenullhypothesisiscontainedwithinthelimitsofaconfidenceinterval. Thecalculatedvalueisalsoreferredtoastheteststatisticandiscalculatedthroughtheuseofdescriptivestatisticsforthesample.Acriticalvalueisidentifiedbyusingthecorrecttable.Analphavalue,aspreviouslydiscussed,isspecifiedbythe



researcherandwillbegiven.Thepvaluecorrespondstothevalueofthecomputedteststatisticandcanbefoundinsometables,ordeterminedusingastatisticalsoftwarepackage.

Whenthevalueofthecomputedteststatisticexceedsthecriticalvalue,(i.e.tcalc>tcrit)wecanrejectthenullhypothesis.When>p,wecanalsorejectthenullhypothesis. Lastly,ifthevaluespecifiedinthenullhypothesisisnotcontainedwithinthelimitsofourconfidenceinterval,wecanonceagainrejectthenullhypothesis. Notethatwhenwearenotabletorejectthenull,weusethephrasefailtorejectthenull.Weneveracceptthenull.Weonlyrejectitorfailtorejectit.Byrejectingthenull,wehaveprovenouralternativehypothesistobetrue.

3. ConducttheappropriatetestThereareseveraldifferentteststatisticsthatyoumustchoosefromwhentestingforstatisticalsignificance.Theteststatisticyouwillusedependsontheknownparametersofthevariable.Ifapopulationstandarddeviation()isknown,thenweusetheztest.Withtheexceptionoftestsofproportionorverysmallpopulations,wewillgenerallyknowonlythestandarddeviationofasample(s),inwhichcaseweusethettest.Therefore,whentalkingaboutstatisticaltestsingeneral,wearereferringtothetdistribution.Thetdistributionlooksverysimilartothenormalzdistribution,butthetailsoneithersideofthecurvearelonger.

Letusnowrevisitthegeneralformulafortheconstructionofateststatistic:

teststatistic=samplestatistichypothesizedpopulationparameterstandarderroroftherelevantsamplestatistic

Forcontinuousdataanalyzedusingthetwosamplettest,thenumeratorcomparesthedifferencebetweenthetwosamplemeans ( ) 21 xx referredtoasthesamplestatisticorpointestimatehere,withthedifferencethatwouldbeexpectedunderatruenullhypothesis(i.e., 0=:H 210 ) referredtoasthehypothesizedpopulationparameter,whichoftenequalszero.Thedenominatorismadeupbythestandarderror,whichservesasourmeasureofvariability.

4. InterprettheresultsThedistributiontablesthatyouwillneedinordertointerpretresultswhenconductingtestsbyhandareincludedattheendofthisworkbook.TheyincludetheStudentsttable,thenormalstandardzdistribution,andthechisquaredistributiontables. TablesneededtocompletetheexercisespresentedinthisworkbookareincludedinAppendix2.

ConfidenceIntervalAroundaMean



Thesamplemean( x )estimatesthepopulationmean()butsuppliesnoinformationonthevariabilityorourconfidenceintheestimate. Forthisreason,weuseconfidenceintervals.

TheintervalestimatemakesuseoftheCentralLimitTheoremandthezscore.Wefirstdeterminehowconfidentwewanttobeinourestimate.Themostcommonlevelofconfidenceis95%.AswelearnedwiththeEmpiricalRule,afeatureofthenormalcurveisthat95%ofthevalueswillbewithintwostandarddeviationsofthemean. Thisvalueof2isroundedupfromtheexactvalueof1.96. Thustheprobability(P)thatzfallsbetween1.96and+1.96is0.95,or95%.

Ifwesubstituteourformula,n/)x( ,forz,weget

Aftersomealgebra,weendupwiththeformulaforthe95%confidenceintervalaroundthemeanas:

Theprobabilitythatthepopulationmeanliesbetweenoursamplemeanisplusorminus1.96timesthestandarderror,whichisequalto95%. Themultiplier1.96waschosenfromthestandardztablewithanalpha0.05.If,forexample,wewantedtocalculatea99%confidenceinterval,wewouldusethezscorethatcorrespondswithanalphaof0.01. (Notethatitisthestandarderrorofthemeanthatwearemultiplyingbythezscore.)

Overview

Theconfidenceintervalofthemeangivestherangeofplausiblevaluesforthetruepopulationmean.

95%ofthetime,thepopulationmeanwillbewithinapproximatelytwostandarderrorsofthesamplemean.

Formula:

95%CI= )n

96.1+x,

n

96.1x(

95.0)96.196.1( = + - zP

95.0)96.1(

96.1( = + /

) -

n

xP

s m

95.0=)n

96.1+x

n

96.1x(P

)n

96.1+x,

n

96.1x(



Thus,the95%confidenceintervalis:

StepbyStepExample:ConfidenceIntervalAroundaMeanYouwanttodeterminethemeanbloodpressureamonggovernmentemployees.Inordertodothis,youmeasurethebloodpressureof200employees. Usethedescriptivestatisticsbelowtodeterminea95%confidenceintervalaroundthemean.

n=200x =127mmHgs=13

Step Example1. Calculatethestandard

errorofthemean.

n

s=SE

SE=200

13=0.92

2. Findthelowerlimitofthe95%confidenceinterval.

95%LL= )SE(96.1x

95%LL= )92.0(96.1127=1271.80=125.2

3. Findtheupperlimitofthe95%confidenceinterval.

95%UL= )SE(96.1+x

95%UL=1271.96(0.92)=1271.80=128.8

4. Interpretthe95%confidenceinterval.

The95%confidenceintervalis(125.2,128.8).Thismeansthatwithrepeatedrandomsampling,95%ofthemeanswillfallbetween125.2and128.8.Weare,therefore,95%confidentthatthisisoneofthoseintervalsandthetruemeanofthepopulation()isbetween125.2and128.8.



Practice:ConfidenceIntervalAroundaMeanYourecordgestationalageatbirthforlivebirthsinthepastmonthatthreeprimaryhealthfacilitiesintheregion. Calculatea95%confidenceintervalaroundthemean.

n=350x =37.5weekss=12.2

Step PracticeSpace1. Calculatethestandard

errorofthemean.

n

s=SE

2. Findthelowerlimitofthe95%confidenceinterval.

95%LL= )SE(96.1x

3. Findtheupperlimitofthe95%confidenceinterval.

95%UL= )SE(96.1+x

4. Interpretthe95%confidenceinterval.



OpenEpiExample:ConfidenceIntervalAroundaMeanUsingthesamebloodpressuredataasbefore,useOpenEpitocalculatea95%confidenceintervalaroundthemean.

n=200x =127mmHgs=13

Step Example1. OpentheOpenEpi

application.FromtheOpenEpimenuchooseMeanCIundertheheading,ContinuousVariables.

2. Enterthedescriptivestatisticsasprompted.

ClickonEnterNewData.

Thescreenshownabovewillopenup.

Usethegiveninformationtofillintheboxes.

Noticethatyouonlyneedtoprovideeitherthestandarddeviation,thestandarderror,orthevariance.Youdonotneedtoprovideallthree.Sincethestandarddeviationisgiven,thisisthestatisticthatwewilluse.

Becauseourpopulationislargeandunknown,wecanusethedefaultnumber,999999999,torepresentthepopulationsize. Ifyouhaveaknownpopulation,specifythatnumberhere.



Step Example3. Calculatethe95%

confidenceinterval.ClickonthebuttonlabeledCalculate.

Apopupwillopendisplayingtheresultsofthecalculation.Notethatyoumustsetyourbrowsertoallowpopupsinordertoviewtheresults.


Choosethe95%confidenceintervalcorrespondingwiththettest,sincewedonotknowthevarianceofthepopulation,onlythestandarddeviationofthesample.

The95%confidenceintervalis(125.2,128.8).

Withrepeatedrandomsampling,95%ofthemeanswillfallbetween125.2and128.8.Weare,therefore,95%confidentthatthisisoneofthoseintervalsandthetruemeanofthepopulation()isbetween125.2and128.8.



ExcelExample:ConfidenceIntervalAroundaMeanWecanfindaconfidenceintervalaroundameanusingdescriptivestatisticsinExcelaswell. Usethesamebloodpressuredatathatweusedinthepreviousexample.

Step Example1. Selecttheconfidence

intervalfunctioninExcel.

Inablankworksheet,chooseInsertfromthetoolbar.Fromthedropdownmenu,selectFunction.

TypeconfidenceintervalintheboxlabeledSearchforafunction.Thefunctionforconfidenceintervals,CONFIDENCEwillappearasyouronlyoption.Alternatively,youcanscrolldownthelistoffunctionsuntilyoufindtheonelabeledCONFIDENCE.

ClickonOK.



Step Example2. Enterthedescriptive

statistics.

Youwillbepromptedtoenterthealpha,standarddeviation,andsamplesize.Sincewearecalculatinga95%confidenceinterval,=1.000.95andistherefore,0.05.

ClickonOK.

Theresultwillthenbedisplayedontheworksheetinthecellmarkedbyyourcursor.

Theresultistheequivalentofz(SE).



Step Example3. Calculatethe95%

confidenceinterval.Therefore,wecancalculatethe95%confidenceintervalbysubtractingandadding1.80tooursamplemeanof127.

95%LL=1271.80=125.2

95%UL=127+1.80=128.8

4. Interpretyourresults. The95%confidenceintervalis(125.2,128.8).Thismeansthatwithrepeatedrandomsampling,95%ofthemeanswillfallbetween125.2and128.8.Weare,therefore,95%confidentthatthisisoneofthoseintervalsandthetruemeanofthepopulation()isbetween125.2and128.8.

YoucanalsouseExceltofindtheconfidenceintervalaroundthemeanifyouaregivenadatasetinsteadofdescriptivestatistics.

ExcelExample:ConfidenceIntervalAroundaMeanForthisexample,wewillusethedatasetSit/Lie.Calculatea95%confidenceintervalaroundthemeanforthevariableSitting.

Step Example1. Importthe

datasetintoExcel.

Importthedataset,twosamplet,byusingthedirectionsintheboxbelow.

ToopenadatasetinExcel:

ChoosetheheadingDatafromthetoolbar.ClickonImportExternalData.ClickonImportData.Openthefolderwhereyouhavestoredthedatabase.Choosethetablethatyouwillbeworkingfrom.ClickOK.Choosewhereyouwouldliketoputthedatabyselectingacellofthecurrentworksheetorseclectinganewworksheet.ClickOK.



Step Example2. Calculatethe

95%confidenceintervalusingExcel.

ChooseToolsfromthetoolbar.SelectDataAnalysisfromthedropdownbox.HighlightDescriptiveStatisticsandclickOK.Youwillseeaboxliketheonebelow:

ClickonthecharticonnexttothetextboxmarkedInputRange.

HighlightthecolumnforthevariableSittingbyclickingontheletterwhichcorrespondswiththecolumn.

ClickonthecharticonintheboxlabeledDescriptiveStatisticstoreturntothedialoguebox.

ChecktheboxnexttoLabelsinFirstRow.

Next,chooseyouroutputoptions. Anewworksheetischosenasthedefault,butifyouwouldlikeyouroutputtoappearonthesameworksheetasyourdataset,selectthefirstoptionunderOutputoptions,OutputRange. Clickontheiconnexttothetextbox. Choosetheareawhereyouwouldlikeyouroutputtoappearbyclickingonacell.Clickontheiconagaintoreturntothedialoguebox.

ChecktheboxesnexttoSummarystatisticsandConfidencelevelforMean.

ClickOK.



Step Example3. Usetheoutput

tocalculatetheconfidenceinterval.

Youroutputwilllooklikethis:

Noticethattheoutputdoesnotactuallyprovideyouwithaconfidenceinterval.Instead,youaregivenanumberwhichrepresentsthedifferencefromthemean.Tofindtheconfidenceintervalaroundthemean,subtractthisnumberfromandaddthisnumbertothemean.

95%CI= x confidencelevel=80.9514.13,80.95+14.13=66.82,95.08


The95%confidenceintervalaroundthemeanis(66.82,95.08).Withrepeatedrandomsampling,95%ofthemeanswillfallbetween66.82and95.08.Weare,therefore,95%confidentthatthisisoneofthoseintervalsandthetruemeanofthepopulation()isbetween66.82and95.08.

ExcelorOpenEpiPractice:ConfidenceIntervalAroundaMeanUsingthedatafromtheHIVKnowledgepretest,calculatethe95%confidenceintervalaroundthemeanscoreforfemalesineitherExcelorOpenEpi.

PretestScores:HIVKnowledge

Females Males

Mean 60 40

SD 12 10

N 138 97

Foradditionalpractice,calculatethe95%confidenceintervalaroundthemeanscoreformalesbyusingthecomputerapplicationthatyoudidnotpreviouslyuse.



1. Opentheappropriateapplication.

2. Enterthedescriptivestatistics.

Step PracticeSpace3. Calculatethe95%

confidenceinterval.

4. Interpretyourresults.

RelatedConcepts

ConfidenceIntervalAroundaProportionConfidenceInterval:TwoSampletTest

ConfidenceIntervalAroundaProportion



TheCentralLimitTheoremalsoapplieswhenconsideringadistributionofsampleproportions,whenthesamplesizeislargeenough.Thesamplingdistributionwouldbeconstructedsimilarlyasforthemean.Howeverthecharacteristicsofthesamplingdistributionwillbedifferentasthisisabinomialdistribution.Wewillbeestimatingthepopulationproportionratherthanthepopulationmean.Sincethebinomialdistributionisasamplingdistributionforp,itsmeanequalsthepopulationmeananditsstandarddeviationrepresentsthestandarderror(SE).

n=samplesizeornumberoftrials p=probabilityofsuccess 1p=probabilityoffailure

SEoftheproportion=n

)p1(p

Asthesamplesize,n,increases,thebinomialdistributionbecomesveryclosetoanormaldistributionduetothecentrallimittheorem

Therefore,thenormaldistributioncanbeusedtocalculateconfidenceintervalsanddohypothesistests

Ifnpandn(1p)areequalto10ormore,thenthenormalapproximationmaybeused

Similartothemethodusedtocalculateaconfidenceintervalaroundamean,tocalculatethe95%confidenceintervalaroundaproportion,wefirstcalculatethestandarderroroftheproportionandthenusethesameformula:

95%CIn

)p1(p96.1p=

Overview

Theconfidenceintervalaroundaproportiongivestherangeofplausiblevaluesforthetruepopulationproportion.

95%ofthetime,thepopulationproportionwillbewithinapproximatelytwostandarderrorsofthesampleproportion.

Formula:

95%CIn

)p1(p96.1p=

,

n)p1(p

96.1+p



StepbyStepExample:ConfidenceIntervalAroundaProportionOutof212pregnantwomentestedforHIV,53hadpositiveresults.Usethisinformationtofinda95%confidenceintervalforthepopulation.

Step Example1. Identifypand1p.

p,theproportionofsuccess= 25.0=21253

1p,theproportionoffailures=10.25=0.75

2. Calculatethe95%lowerlimit.

95%LLn

)p1(p1.96p=

95%LL212

)75.0(25.096.125.0=

=0.25 96.12121875.0

=0.251.96 00088.0=0.25(1.96x0.0297)=0.250.0583=0.1918

3. Calculatethe95%upperlimit.

95%ULn

)p1(p1.96+p=

95%UL212

)75.0(25.096.1+25.0=

=0.25+0.0583=0.3083

4. Interprettheinterval. The95%confidenceintervalis(0.19,0.31).Withrepeatedrandomsampling,95%ofintervalscalculatedwillcontainthetrueproportionofthepopulation.Weare95%confidentthatthisisoneofthoseintervalsandtheprevalenceofHIVinthepopulationisbetween19%and31%.

Note:Yousee(1p)referredtoasqlaterinthisworkbook,aswellasinmanybiostatisticstexts.



Practice:ConfidenceIntervalAroundaProportionUpontesting250confirmedAIDScases,youfindthat116arepositivefortuberculosis.Findthe95%confidenceintervalaroundtheproportionofAIDSpatientsinfectedwithTB.

Step PracticeSpace4. Identifypand1p.

4. Calculatethe95%lowerlimit.

95%LLn

)p1(p1.96p=

4. Calculatethe95%upperlimit.

95%ULn

)p1(p1.96+p=

4. Interprettheinterval.



OpenEpiExample:ConfidenceIntervalAroundaProportionUsingthepreviousexample,wewilldemonstratehowtocalculatea95%confidenceintervalaroundaproportion.Outof212pregnantwomentestedforHIV,53hadpositiveresults.Usethisinformationtofinda95%confidenceintervalforthepopulationinOpenEpi.

Step Example1. OpentheOpenEpi

application.FromtheOpenEpimenuchooseProportionundertheheading,Counts

2. Entertheproportiondataasprompted.

ClickonEnterNewData.

Ascreenliketheoneabovewillopen.

Usethegiveninformationtofillintheboxes.Thenumeratorwillalwaysconsistofthenumberofsuccesses,orp.Thedenominatoristhesizeofthepopulationorsample.

3. Calculatethe95%confidenceinterval.

ClickonthebuttonlabeledCalculate.

Apopupwillopendisplayingtheresultsofthecalculation.Notethatyoumustsetyourbrowsertoallowpopupsinordertoviewtheresults.



Step Example4. Interprettheresults.

OpenEpicalculatesthe95%confidenceintervalbyusingseveraldifferentmethods.ThoughtheeditorsrecommendtheMidPExacttolookatfirst,itistheWald(NormalApproximation)thatcorrespondsmostcloselywithourhandcalculations.

The95%confidenceintervalis(0.19,0.31).Withrepeatedrandomsampling,95%ofintervalscalculatedwillcontainthetrueproportionofthepopulation.Weare95%confidentthatthisisoneofthoseintervalsandtheprevalenceofHIVinthepopulationisbetween19%and31%.



OpenEpiPractice:ConfidenceIntervalAroundaProportionTherehasbeenameningitisoutbreak.Youfindthatinoneschool,threestudentsoutofanenrolled400havebeeninfectedwithmeningitis.UseOpenEpitocalculatea95%confidenceinterval.

1. OpentheOpenEpiapplication.

2. Entertheproportiondataasprompted.

3. Calculatethe95%confidenceinterval.

Step PracticeSpace4. Interprettheresults.

RelatedConcepts

ConfidenceInterval:ztestofProportions

HypothesisTesting:TwoSamplettest



Usedforcontinuousdata,thettestisoneofthemostcommonlyusedstatisticaltestsperformedinthepublichealthandclinicalliterature.Hypothesistesting

Overview

Testemployedtoevaluatethenullhypothesis ( ) 0H thatthepopulationmeansareequalversusthealternativehypothesis ( ) aHthatthepopulationmeansaredifferent.Thistestisusedtocomparethemeansoftwoindependentsamples.

Example:Comparingthedifferenceinmeanbloodpressureforasampleofrefugeestothatofasampleofhostcountryresidents.

Formula: ( ) ( )

2

2p

1

2p

2121

n

s

n

s

xxt

+

- - - =

Assumptions:o Twoindependentrandomsampleso Normallydistributedpopulationo Equal,butunknownvariancesinthetwosamples(Note:ThereisamethodtocomparetwosampleswithunequalvariancescalledSatterwaitesmethod.Pleaserefertoabiostatisticstextforfurtherexplanation.)

Typeofvariables:Continuous Decisionrule:Ifthecalculatedvalueoft( calct )isgreaterthanthe

criticalvalueoft( critt ),thenwecanrejectthenullhypothesis. Tableused:Studentsttable

Where:

( ) ( ) 2nn

s1ns1ns

21

222

2112

p - + - + -

=

andisreferredtoasthepooledvariance.



usingthettestallowsustodeterminewhethertheobserveddifferencebetweenthemeanvaluesoftwogroupsisstatisticallysignificant.

Avitalcomponentusedinthecalculationofthestandarderrorforthetwosamplettestisthepooledvariance,denoted 2ps .Asindicatedabove,amajorassumptionnecessaryforthevalidityofthetwosamplettestisthatthevariancesareunknown,butassumedtobeequal. Wecanjustifythisassumptionbydividingthevarianceofonesamplebythevarianceofthesecondsample

(22

21

ss

). If22

21

ss

equalsavalueoflessthanthree,assumethatthevariancesare

approximatelyequal.Thecloserthatthisvalueistoone,themoreequalthevariancesare. Whenthisassumptionisjustified,apooledestimateofthecommonvariancecanbecalculated ( ) 2ps ,whichestimatestheoverallvarianceoftheentirestudypopulation.

Thepooledestimateisobtainedbycomputingtheweightedaverageofthetwosamplevariances.Thesamplevariances ( ) 2221 sands areweightedaccordingtothenumberofobservationsineach.Ifthesamplesizesareequal( 21 nn = ),thisweightedaverageisthemeanofthetwosamplevariances.Ifthetwogroupsareofunequalsize( 21 nn ),thepooledvarianceiscalculatedasfollows:

( ) ( ) 2nn

s1ns1ns

21

222

2112

p - + - + -

=

OurteststatisticisdistributedintheStudentsttablewith 2nn 21 - + degreesoffreedom.

StepbyStepExample:HypothesisTestingTwoSamplettestCanweconcludethatinfantsbornatalowincomeareaclinic,ontheaverage,tendtobelighterthanthosebornataclinicservingahighincomepopulationarea?Withinthepastmonth,astudenthascollecteddataonbirthweights(grams)from arandomsampleof80deliveriesatahighincomepopulationservingclinic(High)and100deliveriesatalowincomepopulationservingclinic(Low).Therelevantinformationissummarizedbelowinthetable. Letalphaequal0.05.

Clinic n x sHighClinic(1) 80 2800 100LowClinic(2) 100 2650 82



Step Example1. Statethenulland

alternativehypotheses.

Theresearcherwilldetermineifthemeanvalueforonegroupislowerthanthatoftheother,soaonesidedtestofourhypothesesisindicated.

Ournullhypothesisstatesthatthemeanbirthweightofbabiesbornatthehighincomeclinic(1)shouldbelessthanorequaltothatofbabieswhoarebornatthelowincomeclinic(2).Thenullhypothesisiswrittenas:

210 :H m m

Thealternativehypothesisstatesthatthemeanbirthweightofbabiesbornatthehighincomeclinic(1)isgreaterthanthatofthosebornatthelowincomeclinic(2),andiswrittenas:

21a :H m m >

Anotherwayofstatingthehypothesesisbelow.Hereyouarestatingthatthedifferencebetweenthetwopopulationmeans(D)islessthanorequaltozero(null)orthedifferenceisgreaterthanzero(alternative).

0:H 210 - m m 0:H 21a > - m m

2. Statethedecisionrule.

Usingaonesidedtestwithanalphavalueof0.05and 2nn 21 - + =178df,thecriticalvalueoftheteststatisticis1.645. WeobtainthisvaluefromtheStudentsttable.Notethat178degreesoffreedomisnotonthetable,soweapproximateitbyusinginfinity().

Thus,weshouldreject 0H if 1.645tcalc >



Step Example3. Calculatethevalueof

theteststatistic.Computingthevalueoftheteststatisticinvolvesseveralsteps. Theformulawewillfollowis

( ) ( )

2

2p

1

2p

2121

n

s

n

s

xxt

+

- - - =

a. Calculatethedifferenceinsamplemeans.

( ) 21d xxx =

Beginbycomputingthedifferenceinsamplemeans:

( ) 21 - isassumedtobe0becauseournullhypothesisstatesthatthereisnodifferencebetweenthetwopopulations.

( ) 21 xx - iscomputedas: 15026502800 = -

b. Computethevalueofthepooledvariance.

( ) ( ) 2nn

s1ns1ns

21

222

2112

p - + - + -

=

Thepooledvarianceiscalculatedas:

( ) ( ) 8177.955

178829910079

s22

2p =

+ =

c. Findthevalueforthestandarderror.

2

2p

1

2p

n

s+

n

s=SE

Thiswillbethedenominatorofthetcalcequation.Usingthepooledvariancecalculatedabove,thestandarderroriscomputedas:

13.56100

8177.95580

8177.955 = +

d. Determinethevalueof calct .

( ) ( )

2

2p

1

2p

2121

n

s

n

s

xxt

+

- - - =

Specifically,wearetakingourcalculationsfrompartsaandcandsubstitutingthoseintoourformula.

11.0613.56

0150tcalc = =



Step Example4. Statethestatistical

decision.Wereject 0H sincethevalueofourteststatistic calct=11.06exceedsthetcriticalvalueof1.645.Wethereforehaveevidencethatourteststatisticfallsintherejectionregion.

5. Reportthepvalue. Forthistest,apvalue



Step PracticeSpace3. Calculatethevalueof

theteststatistic.

a. Calculatethedifferenceinsamplemeans.

( ) 21d xxx =

b. Computethevalueofthepooledvariance.

( ) ( ) 2nn

s1ns1ns

21

222

2112

p - + - + -

=

c. Findthevalueforthestandarderror.

2

2p

1

2p

n

s+

n

s=SE

d. Determinethevalueof calct .

( ) ( )

2

2p

1

2p

2121

n

s

n

s

xxt

+

- - - =

4. Statethestatisticaldecision.



Step PracticeSpace5. Reportthepvalue.

6. Statethepracticalconclusion.

EpiInfoExample:HypothesisTestingTwoSamplettestWewillusetheexampleonpage86toconductatwosamplettestinExcel. Wearedeterminingwhetherinfantsbornatalowincomeareaclinictendtohavealowerbirthweightthanthosebornataclinicservinganareawithahighincomepopulation.Forthisstatisticaltest,wewilluseaonetailedanalysissincewewanttoknowspecificallywhetherbabiesbornattheclinicservingalowincomepopulationarea,ontheaverage,tendtobelighterthanthosebornattheclinicservingahighincomepopulationarea,andnotonlyifthebirthweightsdiffer.Assumeanof0.05.

Step Example1. Statethenulland


H0:12or120(Babiesborninthehighincomeareaclinicweighlessthanorequaltothoseborninaclinicservingalowincomearea.)

Ha:1>2or12>0(Babiesborninthehighincomeareaclinicweighmorethanthosebabiesborninaclinicservingalowincomearea.)


Wewillchooseanalphavalueof0.05inordertocompareourresultswiththecomputerprogramtothosewhichwepreviouslycalculatedbyhand.

If>p,wecanrejectthenullhypothesis.

Inaddition,ifweknowthecriticaltvalue,theniftcalc>tcrit,wecanrejectthenullhypothesis.



Step Example3. Executethetwo

samplettest.

a. READthedatabasefile.

OpenEpiInfoandchooseAnalyzeData.

Choosethetabletwo_sample_tfromthedatasetBios_Workbook_Examples.

b. SelecttheMEANScommand.

UsethearrowunderMeansoftoscrollthroughthevariables.ChooseBirthweight.

ScrolldownunderCrosstabulatebyValueofandchooseClinic.

ClickonOK.

Scrolldowntofindthedescriptivestatistics.Theyshouldlooklikethis:

4. Reportthepvalueand/orthecalculatedtvalue.

Ourpvaluegivenintheoutputis0.00.

Wehavefoundatstatisticof11.05,whichdiffersonlyslightlyfromthetstatisticcalculated(11.06)onpage88.Thiscouldbeduetoroundingerrorsthatwemadeinourcalculations.

NotethatEpiInfousesanalphavalueof0.05andatwotailedtestasdefaults.



Step Example5. Statethestatistical

decision.Sinceourpvalueof0.00*islessthanthealphaof0.05,wehavesufficientevidencetoconcludethatthereisasignificantdifferencebetweenbirthweightsinthetwoclinics.

RememberthatwecanfindourcriticaltvaluebyusingtheStudentsttable.Inthiscaseitis1.645(usethetotalobservationstofindNandthetotaldegreesoffreedom).Sinceourcalculatedtis11.0545andisgreaterthan1.645,wecanconfirmtheabilitytorejectthenullhypothesis.

6. Statethepracticalconclusion.

Becausep



EpiInfoPractice:HypothesisTestingTwoSamplettestTherewasanoutbreakofcholeraamongstudentsinavillageschool. Youweregivenarecordofthoseinfectedbytheschooldirector. Ofthestudentsinfectedwithcholera,youwanttodetermineifthereisasignificantdifferenceintheageoftheinfectedbygender.UsethettestinEpiInfotodetermineifthereisasignificantdifference(alpha=0.05)betweenthemeanagesofmalesandfemalesinfectedwithcholera.UsethetableAgeInSchoolfromthedataset,Bios_Workbook_Examples.

Step PracticeSpace1. Statethenulland



3. Performatwosamp

biostatistics workbook aug07-1

Documents