computational and inferential thinking
TRANSCRIPT
0
1
1.1
1.1.1
1.1.2
1.2
1.2.1
1.3
1.3.1
1.3.2
1.3.3
1.3.4
1.3.5
1.4
1.4.1
1.4.2
1.4.3
1.4.4
1.4.5
1.5
1.5.1
1.5.2
1.6
1.6.1
1.6.2
2
2.1
2.2
3
3.1
TableofContentsIntroduction
DataScience
Introduction
ComputationalTools
StatisticalTechniques
WhyDataScience?
Example:PlottingtheClassics
CausalityandExperiments
ObservationandVisualization:JohnSnowandtheBroadStreetPump
Snow’s“GrandExperiment”
EstablishingCausality
Randomization
Endnote
Expressions
Names
Example:GrowthRates
CallExpressions
DataTypes
Comparisons
Sequences
Arrays
Ranges
Tables
Functions
FunctionsandTables
Visualization
BarCharts
Histograms
Sampling
Sampling
ComputationalandInferentialThinking
1
3.2
3.3
3.4
3.5
3.6
3.7
4
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
5
5.1
5.2
5.3
5.4
5.5
5.6
5.7
6
6.1
Iteration
Estimation
Center
Spread
TheNormalDistribution
Exploration:Privacy
Prediction
Correlation
Regression
Prediction
Higher-OrderFunctions
Errors
MultipleRegression
Classification
Features
Inference
ConfidenceIntervals
DistanceBetweenDistributions
HypothesisTesting
Permutation
A/BTesting
RegressionInference
RegressionDiagnostics
Probability
ConditionalProbability
ComputationalandInferentialThinking
2
ComputationalandInferentialThinking
TheFoundationsofDataScienceByAniAdhikariandJohnDeNero
ContributionsbyDavidWagner
ThisisthetextbookfortheFoundationsofDataScienceclassatUCBerkeley.
ViewthistextbookonlineonGitbooks.
ComputationalandInferentialThinking
3Introduction
WhatisDataScienceDataScienceisaboutdrawingusefulconclusionsfromlargeanddiversedatasetsthroughexploration,prediction,andinference.Explorationinvolvesidentifyingpatternsininformation.Predictioninvolvesusinginformationweknowtomakeinformedguessesaboutvalueswewishweknew.Inferenceinvolvesquantifyingourdegreeofcertainty:willthosepatternswefoundalsoappearinnewobservations?Howaccurateareourpredictions?Ourprimarytoolsforexplorationarevisualizations,forpredictionaremachinelearningandoptimization,andforinferencearestatisticaltestsandmodels.
Statisticsisacentralcomponentofdatasciencebecausestatisticsstudieshowtomakerobustconclusionswithincompleteinformation.Computingisacentralcomponentbecauseprogrammingallowsustoapplyanalysistechniquestothelargeanddiversedatasetsthatariseinreal-worldapplications:notjustnumbers,buttext,images,videos,andsensorreadings.Datascienceisallofthesethings,butitmorethanthesumofitspartsbecauseoftheapplications.Throughunderstandingaparticulardomain,datascientistslearntoaskappropriatequestionsabouttheirdataandcorrectlyinterprettheanswersprovidedbyourinferentialandcomputationaltools.
ComputationalandInferentialThinking
4DataScience
Chapter1:IntroductionDataaredescriptionsoftheworldaroundus,collectedthroughobservationandstoredoncomputers.Computersenableustoinferpropertiesoftheworldfromthesedescriptions.Datascienceisthedisciplineofdrawingconclusionsfromdatausingcomputation.Therearethreecoreaspectsofeffectivedataanalysis:exploration,prediction,andinference.Thistextdevelopsaconsistentapproachtoallthree,introducingstatisticalideasandfundamentalideasincomputerscienceconcurrently.Wefocusonaminimalsetofcoretechniquesthattheyapplytoavastrangeofreal-worldapplications.Afoundationindatasciencerequiresnotonlyunderstandingstatisticalandcomputationaltechniques,butalsorecognizinghowtheyapplytorealscenarios.
Forwhateveraspectoftheworldwewishtostudy---whetherit'stheEarth'sweather,theworld'smarkets,politicalpolls,orthehumanmind---datawecollecttypicallyofferanincompletedescriptionofthesubjectathand.Thecentralchallengeofdatascienceistomakereliableconclusionsusingthispartialinformation.
Inthisendeavor,wewillcombinetwoessentialtools:computationandrandomization.Forexample,wemaywanttounderstandclimatechangetrendsusingtemperatureobservations.Computerswillallowustouseallavailableinformationtodrawconclusions.Ratherthanfocusingonlyontheaveragetemperatureofaregion,wewillconsiderthewholerangeoftemperaturestogethertoconstructamorenuancedanalysis.Randomnesswillallowustoconsiderthemanydifferentwaysinwhichincompleteinformationmightbecompleted.Ratherthanassumingthattemperaturesvaryinaparticularway,wewilllearntouserandomnessasawaytoimaginemanypossiblescenariosthatareallconsistentwiththedataweobserve.
Applyingthisapproachrequireslearningtoprogramacomputer,andsothistextinterleavesacompleteintroductiontoprogrammingthatassumesnopriorknowledge.Readerswithprogrammingexperiencewillfindthatwecoverseveraltopicsincomputationthatdonotappearinatypicalintroductorycomputersciencecurriculum.Datasciencealsorequirescarefulreasoningaboutquantities,butthistextdoesnotassumeanybackgroundinmathematicsorstatisticsbeyondbasicalgebra.Youwillfindveryfewequationsinthistext.Instead,techniquesaredescribedtoreadersinthesamelanguageinwhichtheyaredescribedtothecomputersthatexecutethem---aprogramminglanguage.
ComputationalandInferentialThinking
5Introduction
ComputationalToolsThistextusesthePython3programminglanguage,alongwithastandardsetofnumericalanddatavisualizationtoolsthatareusedwidelyincommercialapplications,scientificexperiments,andopen-sourceprojects.Pythonhasrecruitedenthusiastsfrommanyprofessionsthatusedatatodrawconclusions.BylearningthePythonlanguage,youwilljoinamillion-person-strongcommunityofsoftwaredevelopersanddatascientists.
GettingStarted.TheeasiestandrecommendedwaytostartwritingprogramsinPythonistologintothecompanionsiteforthistext,data8.berkeley.edu.Ifyouareenrolledinacourseofferingthatusesthistext,youshouldhavefullaccesstotheprogrammingenvironmenthostedonthatsite.Thefirsttimeyoulogin,youwillfindabrieftutorialdescribinghowtoproceed.
Youarenotatallrestrictedtousingthisweb-basedprogrammingenvironment.APythonprogramcanbeexecutedbyanycomputer,regardlessofitsmanufactureroroperatingsystem,providedthatsupportforthelanguageisinstalled.IfyouwishtoinstalltheversionofPythonanditsaccompanyinglibrariesthatwillmatchthistext,werecommendtheAnacondadistributionthatpackagestogetherthePython3languageinterpreter,IPythonlibraries,andtheJupyternotebookenvironment.
Thistextincludesacompleteintroductiontoallofthesecomputationaltools.Youwilllearntowriteprograms,generateimagesfromdata,andworkwithreal-worlddatasetsthatarepublishedonline.
ComputationalandInferentialThinking
6ComputationalTools
StatisticalTechniquesThedisciplineofstatisticshaslongaddressedthesamefundamentalchallengeasdatascience:howtodrawconclusionsabouttheworldusingincompleteinformation.Oneofthemostimportantcontributionsofstatisticsisaconsistentandprecisevocabularyfordescribingtherelationshipbetweenobservationsandconclusions.Thistextcontinuesinthesametradition,focusingonasetofcoreinferentialproblemsfromstatistics:testinghypotheses,estimatingconfidence,andpredictingunknownquantities.
Datascienceextendsthefieldofstatisticsbytakingfulladvantageofcomputing,datavisualization,andaccesstoinformation.ThecombinationoffastcomputersandtheInternetgivesanyonetheabilitytoaccessandanalyzevastdatasets:millionsofnewsarticles,fullencyclopedias,databasesforanydomain,andmassiverepositoriesofmusic,photos,andvideo.
Applicationstorealdatasetsmotivatethestatisticaltechniquesthatwedescribethroughoutthetext.Realdataoftendonotfollowregularpatternsormatchstandardequations.Theinterestingvariationinrealdatacanbelostbyfocusingtoomuchattentiononsimplisticsummariessuchasaveragevalues.Computersenableafamilyofmethodsbasedonresamplingthatapplytoawiderangeofdifferentinferenceproblems,takeintoaccountallavailableinformation,andrequirefewassumptionsorconditions.Althoughthesetechniqueshaveoftenbeenreservedforgraduatecoursesinstatistics,theirflexibilityandsimplicityareanaturalfitfordatascienceapplications.
ComputationalandInferentialThinking
7StatisticalTechniques
WhyDataScience?Mostimportantdecisionsaremadewithonlypartialinformationanduncertainoutcomes.However,thedegreeofuncertaintyformanydecisionscanbereducedsharplybypublicaccesstolargedatasetsandthecomputationaltoolsrequiredtoanalyzethemeffectively.Data-drivendecisionmakinghasalreadytransformedatremendousbreadthofindustries,includingfinance,advertising,manufacturing,andrealestate.Atthesametime,awiderangeofacademicdisciplinesareevolvingrapidlytoincorporatelarge-scaledataanalysisintotheirtheoryandpractice.
Studyingdatascienceenablesindividualstobringthesetechniquestobearontheirwork,theirscientificendeavors,andtheirpersonaldecisions.Criticalthinkinghaslongbeenahallmarkofarigorouseducation,butcritiquesareoftenmosteffectivewhensupportedbydata.Acriticalanalysisofanyaspectoftheworld,mayitbebusinessorsocialscience,involvesinductivereasoning---conclusionscanrarelybeenprovenoutright,onlysupportedbytheavailableevidence.Datascienceprovidesthemeanstomakeprecise,reliable,andquantitativeargumentsaboutanysetofobservations.Withunprecedentedaccesstoinformationandcomputing,criticalthinkingaboutanyaspectoftheworldthatcanbemeasuredwouldbeincompletewithouttheinferentialtechniquesthatarecoretodatascience.
Theworldhastoomanyunansweredquestionsanddifficultchallengestoleavethiscriticalreasoningtoonlyafewspecialists.However,alleducatedadultscanbuildthecapacitytoreasonaboutdata.Thetools,techniques,anddatasetsareallreadilyavailable;thistextaimstomakethemaccessibletoeveryone.
ComputationalandInferentialThinking
8WhyDataScience?
Example:PlottingtheClassicsInteractInthisexample,wewillexplorestatisticsfortwoclassicnovels:TheAdventuresofHuckleberryFinnbyMarkTwain,andLittleWomenbyLouisaMayAlcott.Thetextofanybookcanbereadbyacomputeratgreatspeed.Bookspublishedbefore1923arecurrentlyinthepublicdomain,meaningthateveryonehastherighttocopyorusethetextinanyway.ProjectGutenbergisawebsitethatpublishespublicdomainbooksonline.UsingPython,wecanloadthetextofthesebooksdirectlyfromtheweb.
ThefeaturesofPythonusedinthisexamplewillbeexplainedindetaillaterinthecourse.Thisexampleismeanttoillustratesomeofthebroadthemesofthistext.Don'tworryifthedetailsoftheprogramdon'tyetmakesense.Instead,focusoninterpretingtheimagesgeneratedbelow.The"Expressions"sectionlaterinthischapterwilldescribemostofthefeaturesofthePythonprogramminglanguageusedbelow.
First,wereadthetextofbothbooksintolistsofchapters,calledhuck_finn_chaptersandlittle_women_chapters.InPython,anamecannotcontainanyspaces,andsowewillofoftenuseanunderscore_tostandinforaspace.The=inthelinesbelowgiveanameonthelefttotheresultofsomecomputationdescribedontheright.AuniformresourcelocatororURLisanaddressontheInternetforsomecontent;inthiscase,thetextofabook.
#Readtwobooks,fast!
huck_finn_url='http://www.gutenberg.org/cache/epub/76/pg76.txt'
huck_finn_text=read_url(huck_finn_url)
huck_finn_chapters=huck_finn_text.split('CHAPTER')[44:]
little_women_url='http://www.gutenberg.org/cache/epub/514/pg514.txt'
little_women_text=read_url(little_women_url)
little_women_chapters=little_women_text.split('CHAPTER')[1:]
Whileacomputercannotunderstandthetextofabook,wecanstilluseittoprovideuswithsomeinsightintothestructureofthetext.Thenamehuck_finn_chaptersiscurrentlyboundtoalistofallthechaptersinthebook.Wecanplacethosechaptersintoatabletoseehoweachbegins.
ComputationalandInferentialThinking
9Example:PlottingtheClassics
Table([huck_finn_chapters],['Chapters'])
Chapters
I.YOUdon'tknowaboutmewithoutyouhavereadabook...
II.WEwenttiptoeingalongapathamongstthetreesbac...
III.WELL,Igotagoodgoing-overinthemorningfromo...
IV.WELL,threeorfourmonthsrunalong,anditwaswel...
V.Ihadshutthedoorto.ThenIturnedaroundandther...
VI.WELL,prettysoontheoldmanwasupandaroundagai...
VII."GITup!Whatyou'bout?"Iopenedmyeyesandlook...
VIII.THEsunwasupsohighwhenIwakedthatIjudged...
IX.Iwantedtogoandlookataplacerightaboutthem...
X.AFTERbreakfastIwantedtotalkaboutthedeadmana...
...(33rowsomitted)
EachchapterbeginswithachapternumberinRomannumerals,followedbythefirstsentenceofthechapter.ProjectGutenberghasprintedthefirstwordofeachchapterinuppercase.
ThebookdescribesajourneythatHuckandJimtakealongtheMississippiRiver.TomSawyerjoinsthemtowardstheendastheactionheatsup.Wecanquicklyvisualizehowmanytimesthesecharactershaveeachbeenmentionedatanypointinthebook.
counts=Table([np.char.count(huck_finn_chapters,"Jim"),
np.char.count(huck_finn_chapters,"Tom"),
np.char.count(huck_finn_chapters,"Huck")],
["Jim","Tom","Huck"])
counts.cumsum().plot(overlay=True)
ComputationalandInferentialThinking
10Example:PlottingtheClassics
Intheplotabove,thehorizontalaxisshowschapternumbersandtheverticalaxisshowshowmanytimeseachcharacterhasbeenmentionedsofar.YoucanseethatJimisacentralcharacterbythelargenumberoftimeshisnameappears.NoticehowTomishardlymentionedformuchofthebookuntilafterChapter30.HiscurveandJim'srisesharplyatthatpoint,astheactioninvolvingbothofthemintensifies.AsforHuck,hisnamehardlyappearsatall,becauseheisthenarrator.
LittleWomenisastoryoffoursistersgrowinguptogetherduringthecivilwar.Inthisbook,chapternumbersarespelledoutandchaptertitlesarewritteninallcapitalletters.
Table('Chapters':little_women_chapters)
Chapters
ONEPLAYINGPILGRIMS"Christmaswon'tbeChristmaswitho...
TWOAMERRYCHRISTMASJowasthefirsttowakeinthegr...
THREETHELAURENCEBOY"Jo!Jo!Whereareyou?"criedMe...
FOURBURDENS"Oh,dear,howharditdoesseemtotakeup...
FIVEBEINGNEIGHBORLY"Whatintheworldareyougoingt...
SIXBETHFINDSTHEPALACEBEAUTIFULThebighousedidpr...
SEVENAMY'SVALLEYOFHUMILIATION"Thatboyisaperfect...
EIGHTJOMEETSAPOLLYON"Girls,whereareyougoing?"as...
NINEMEGGOESTOVANITYFAIR"Idothinkitwasthemost...
TENTHEP.C.ANDP.O.Asspringcameon,anewsetofam...
...(37rowsomitted)
ComputationalandInferentialThinking
11Example:PlottingtheClassics
Wecantrackthementionsofmaincharacterstolearnabouttheplotofthisbookaswell.TheprotagonistJointeractswithhersistersMeg,Beth,andAmyregularly,upuntilChapter27whenshemovestoNewYorkalone.
people=["Meg","Jo","Beth","Amy","Laurie"]
people_counts=Table(pp:np.char.count(little_women_chapters,pp)
people_counts.cumsum().plot(overlay=True)
Laurieisayoungmanwhomarriesoneofthegirlsintheend.Seeifyoucanusetheplotstoguesswhichone.
Inadditiontovisualizingdatatofindpatterns,datascienceisoftenconcernedwithwhethertwosimilarpatternsarereallythesameordifferent.Inthecontextofnovels,theword"character"hasasecondmeaning:aprintedsymbolsuchasaletterornumber.Below,wecomparethecountsofthistypeofcharacterinthefirstchapterofLittleWomen.
fromcollectionsimportCounter
chapter_one=little_women_chapters[0]
counts=Counter(chapter_one.lower())
letters=Table([counts.keys(),counts.values()],['Letters','Chapter1Count'
letters.sort('Letters').show(20)
ComputationalandInferentialThinking
12Example:PlottingtheClassics
Letters Chapter1Count
4102
! 28
" 190
' 125
, 423
- 21
. 189
; 4
? 16
_ 6
a 1352
b 290
c 366
d 788
e 1981
f 339
g 446
h 1069
i 1071
j 51
...(16rowsomitted)
HowdothesecountscomparewiththecharactercountsinChapter2?Byplottingbothsetsofcountstogether,wecanseeslightvariationsforeverycharacter.Isthatdifferencemeaningful?Howsurecanwebe?Suchquestionswillbeaddressedpreciselyinthistext.
counts=Counter(little_women_chapters[1].lower())
two_letters=Table([counts.keys(),counts.values()],['Letters','Chapter2Count'
compare=letters.join('Letters',two_letters)
compare.barh('Letters',overlay=True)
ComputationalandInferentialThinking
13Example:PlottingtheClassics
ComputationalandInferentialThinking
14Example:PlottingtheClassics
Insomecases,therelationshipsbetweenquantitiesallowustomakepredictions.Thistextwillexplorehowtomakeaccuratepredictionsbasedonincompleteinformationanddevelopmethodsforcombiningmultiplesourcesofuncertaininformationtomakedecisions.
Asanexample,letusfirstusethecomputertogetussomeinformationthatwouldbereallytedioustoacquirebyhand:thenumberofcharactersandthenumberofperiodsineachchapterofbothbooks.
chlen_per_hf=Table([[len(s)forsinhuck_finn_chapters],
np.char.count(huck_finn_chapters,'.')],
["HFChapterLength","Numberofperiods"])
chlen_per_lw=Table([[len(s)forsinlittle_women_chapters],
np.char.count(little_women_chapters,'.')],
["LWChapterLength","Numberofperiods"])
HerearethedataforHuckleberryFinn.Eachrowofthetablebelowcorrespondstoonechapterofthenovelanddisplaysthenumberofcharactersaswellasthenumberofperiodsinthechapter.Notsurprisingly,chapterswithfewercharactersalsotendtohavefewerperiods,ingeneral;theshorterthechapter,thefewersentencestheretendtobe,andviceversa.Therelationisnotentirelypredictable,however,assentencesareofvaryinglengthsandcaninvolveotherpunctuationsuchasquestionmarks.
chlen_per_hf
HFChapterLength Numberofperiods
7026 66
11982 117
8529 72
6799 84
8166 91
14550 125
13218 127
22208 249
8081 71
7036 70
...(33rowsomitted)
ComputationalandInferentialThinking
15Example:PlottingtheClassics
HerearethecorrespondingdataforLittleWomen.
chlen_per_lw
LWChapterLength Numberofperiods
21759 189
22148 188
20558 231
25526 195
23395 255
14622 140
14431 131
22476 214
33767 337
18508 185
...(37rowsomitted)
YoucanseethatthechaptersofLittleWomenareingenerallongerthanthoseofHuckleberryFinn.Letusseeifthesetwosimplevariables–thelengthandnumberofperiodsineachchapter–cantellusanythingmoreaboutthetwoauthors.Onewayforustodothisistoplotbothsetsofdataonthesameaxes.
Intheplotbelow,thereisadotforeachchapterineachbook.BluedotscorrespondtoHuckleberryFinnandyellowdotstoLittleWomen.Thehorizontalaxisrepresentsthenumberofcharactersandtheverticalaxisrepresentsthenumberofperiods.
plots.figure(figsize=(10,10))
plots.scatter([len(s)forsinhuck_finn_chapters],np.char.count(huck_finn_chapters
plots.scatter([len(s)forsinlittle_women_chapters],np.char.count
<matplotlib.collections.PathCollectionat0x109ebee10>
ComputationalandInferentialThinking
16Example:PlottingtheClassics
TheplotshowsusthatmanybutnotallofthechaptersofLittleWomenarelongerthanthoseofHuckleberryFinn,aswehadobservedbyjustlookingatthenumbers.Butitalsoshowsussomethingmore.Noticehowthebluepointsareroughlyclusteredaroundastraightline,asaretheyellowpoints.Moreover,itlooksasthoughbothcolorsofpointsmightbeclusteredaroundthesamestraightline.
Indeed,itappearsfromlookingattheplotthatonaveragebothbookstendtohavesomewherebetween100and150charactersbetweenperiods.Perhapsthesetwogreat19thcenturynovelsweresignalingsomethingsoveryfamiliarusnow:the140-characterlimitofTwitter.
ComputationalandInferentialThinking
17Example:PlottingtheClassics
CausalityandExperiments"Theseproblemsare,andwillprobablyeverremain,amongtheinscrutablesecretsofnature.Theybelongtoaclassofquestionsradicallyinaccessibletothehumanintelligence."—TheTimesofLondon,September1849,onhowcholeraiscontractedandspread
Doesthedeathpenaltyhaveadeterrenteffect?Ischocolategoodforyou?Whatcausesbreastcancer?
Allofthesequestionsattempttoassignacausetoaneffect.Acarefulexaminationofdatacanhelpshedlightonquestionslikethese.Inthissectionyouwilllearnsomeofthefundamentalconceptsinvolvedinestablishingcausality.
Observationisakeytogoodscience.Anobservationalstudyisoneinwhichscientistsmakeconclusionsbasedondatathattheyhaveobservedbuthadnohandingenerating.Indatascience,manysuchstudiesinvolveobservationsonagroupofindividuals,afactorofinterestcalledatreatment,andanoutcomemeasuredoneachindividual.
Itiseasiesttothinkoftheindividualsaspeople.Inastudyofwhetherchocolateisgoodforthehealth,theindividualswouldindeedbepeople,thetreatmentwouldbeeatingchocolate,andtheoutcomemightbeameasureofbloodpressure.Butindividualsinobservationalstudiesneednotbepeople.Inastudyofwhetherthedeathpenaltyhasadeterrenteffect,theindividualscouldbethe50statesoftheunion.Astatelawallowingthedeathpenaltywouldbethetreatment,andanoutcomecouldbethestate’smurderrate.
Thefundamentalquestioniswhetherthetreatmenthasaneffectontheoutcome.Anyrelationbetweenthetreatmentandtheoutcomeiscalledanassociation.Ifthetreatmentcausestheoutcometooccur,thentheassociationiscausal.Causalityisattheheartofallthreequestionsposedatthestartofthissection.Forexample,oneofthequestionswaswhetherchocolatedirectlycausesimprovementsinhealth,notjustwhethertherethereisarelationbetweenchocolateandhealth.
Theestablishmentofcausalityoftentakesplaceintwostages.First,anassociationisobserved.Next,amorecarefulanalysisleadstoadecisionaboutcausality.
ComputationalandInferentialThinking
18CausalityandExperiments
ObservationandVisualization:JohnSnowandtheBroadStreetPumpOneoftheearliestexamplesofastuteobservationeventuallyleadingtotheestablishmentofcausalitydatesbackmorethan150years.Togetyourmindintotherighttimeframe,trytoimagineLondoninthe1850’s.Itwastheworld’swealthiestcitybutmanyofitspeopleweredesperatelypoor.CharlesDickens,thenattheheightofhisfame,waswritingabouttheirplight.Diseasewasrifeinthepoorerpartsofthecity,andcholerawasamongthemostfeared.Itwasnotyetknownthatgermscausedisease;theleadingtheorywasthat“miasmas”werethemainculprit.Miasmasmanifestedthemselvesasbadsmells,andwerethoughttobeinvisiblepoisonousparticlesarisingoutofdecayingmatter.PartsofLondondidsmellverybad,especiallyinhotweather.Toprotectthemselvesagainstinfection,thosewhocouldaffordtoheldsweet-smellingthingstotheirnoses.
Forseveralyears,adoctorbythenameofJohnSnowhadbeenfollowingthedevastatingwavesofcholerathathitEnglandfromtimetotime.Thediseasearrivedsuddenlyandwasalmostimmediatelydeadly:peoplediedwithinadayortwoofcontractingit,hundredscoulddieinaweek,andthetotaldeathtollinasinglewavecouldreachtensofthousands.Snowwasskepticalofthemiasmatheory.Hehadnoticedthatwhileentirehouseholdswerewipedoutbycholera,thepeopleinneighboringhousessometimesremainedcompletelyunaffected.Astheywerebreathingthesameair—andmiasmas—astheirneighbors,therewasnocompellingassociationbetweenbadsmellsandtheincidenceofcholera.
Snowhadalsonoticedthattheonsetofthediseasealmostalwaysinvolvedvomitinganddiarrhea.Hethereforebelievedthatthatinfectionwascarriedbysomethingpeopleateordrank,notbytheairthattheybreathed.Hisprimesuspectwaswatercontaminatedbysewage.
AttheendofAugust1854,cholerastruckintheovercrowdedSohodistrictofLondon.Asthedeathsmounted,Snowrecordedthemdiligently,usingamethodthatwentontobecomestandardinthestudyofhowdiseasesspread:hedrewamap.Onastreetmapofthedistrict,herecordedthelocationofeachdeath.
HereisSnow’soriginalmap.Eachblackbarrepresentsonedeath.Theblackdiscsmarkthelocationsofwaterpumps.Themapdisplaysastrikingrevelation–thedeathsareroughlyclusteredaroundtheBroadStreetpump.
ComputationalandInferentialThinking
19ObservationandVisualization:JohnSnowandtheBroadStreetPump
Snowstudiedhismapcarefullyandinvestigatedtheapparentanomalies.AllofthemimplicatedtheBroadStreetpump.Forexample:
ThereweredeathsinhousesthatwerenearertheRupertStreetpumpthantheBroadStreetpump.ThoughtheRupertStreetpumpwascloserasthecrowflies,itwaslessconvenienttogettobecauseofdeadendsandthelayoutofthestreets.TheresidentsinthosehousesusedtheBroadStreetpumpinstead.Therewerenodeathsintwoblocksjusteastofthepump.ThatwasthelocationoftheLionBrewery,wheretheworkersdrankwhattheybrewed.Iftheywantedwater,thebreweryhaditsownwell.TherewerescattereddeathsinhousesseveralblocksawayfromtheBroadStreetpump.ThosewerechildrenwhodrankfromtheBroadStreetpumpontheirwaytoschool.Thepump’swaterwasknowntobecoolandrefreshing.
ThefinalpieceofevidenceinsupportofSnow’stheorywasprovidedbytwoisolateddeathsintheleafyandgenteelHampsteadarea,quitefarfromSoho.SnowwaspuzzledbytheseuntilhelearnedthatthedeceasedwereMrs.SusannahEley,whohadoncelivedinBroad
ComputationalandInferentialThinking
20ObservationandVisualization:JohnSnowandtheBroadStreetPump
Street,andherniece.Mrs.EleyhadwaterfromtheBroadStreetpumpdeliveredtoherinHampsteadeveryday.Shelikeditstaste.
LateritwasdiscoveredthatacesspitthatwasjustafewfeetawayfromthewelloftheBroadStreetpumphadbeenleakingintothewell.Thusthepump’swaterwascontaminatedbysewagefromthehousesofcholeravictims.
SnowusedhismaptoconvincelocalauthoritiestoremovethehandleoftheBroadStreetpump.Thoughthecholeraepidemicwasalreadyonthewanewhenhedidso,itispossiblethatthedisablingofthepumppreventedmanydeathsfromfuturewavesofthedisease.
TheremovaloftheBroadStreetpumphandlehasbecomethestuffoflegend.AttheCentersforDiseaseControl(CDC)inAtlanta,whenscientistslookforsimpleanswerstoquestionsaboutepidemics,theysometimesaskeachother,“Whereisthehandletothispump?”
Snow’smapisoneoftheearliestandmostpowerfulusesofdatavisualization.Diseasemapsofvariouskindsarenowastandardtoolfortrackingepidemics.
TowardsCausality
ThoughthemapgaveSnowastrongindicationthatthecleanlinessofthewatersupplywasthekeytocontrollingcholera,hewasstillalongwayfromaconvincingscientificargumentthatcontaminatedwaterwascausingthespreadofthedisease.Tomakeamorecompellingcase,hehadtousethemethodofcomparison.
Scientistsusecomparisontoidentifyanassociationbetweenatreatmentandanoutcome.Theycomparetheoutcomesofagroupofindividualswhogotthetreatment(thetreatmentgroup)totheoutcomesofagroupwhodidnot(thecontrolgroup).Forexample,researcherstodaymightcomparetheaveragemurderrateinstatesthathavethedeathpenaltywiththeaveragemurderrateinstatesthatdon’t.
Iftheresultsaredifferent,thatisevidenceforanassociation.Todeterminecausation,however,evenmorecareisneeded.
ComputationalandInferentialThinking
21ObservationandVisualization:JohnSnowandtheBroadStreetPump
Snow’s“GrandExperiment”EncouragedbywhathehadlearnedinSoho,Snowcompletedamorethoroughanalysisofcholeradeaths.Forsometime,hehadbeengatheringdataoncholeradeathsinanareaofLondonthatwasservedbytwowatercompanies.TheLambethwatercompanydrewitswaterupriverfromwheresewagewasdischargedintotheRiverThames.Itswaterwasrelativelyclean.ButtheSouthwarkandVauxhall(S&V)companydrewitswaterbelowthesewagedischarge,andthusitssupplywascontaminated.
SnownoticedthattherewasnosystematicdifferencebetweenthepeoplewhoweresuppliedbyS&VandthosesuppliedbyLambeth.“Eachcompanysuppliesbothrichandpoor,bothlargehousesandsmall;thereisnodifferenceeitherintheconditionoroccupationofthepersonsreceivingthewaterofthedifferentCompanies…thereisnodifferencewhateverinthehousesorthepeoplereceivingthesupplyofthetwoWaterCompanies,orinanyofthephysicalconditionswithwhichtheyaresurrounded…”
Theonlydifferencewasinthewatersupply,“onegroupbeingsuppliedwithwatercontainingthesewageofLondon,andamongstit,whatevermighthavecomefromthecholerapatients,theothergrouphavingwaterquitefreefromimpurity.”
Confidentthathewouldbeabletoarriveataclearconclusion,Snowsummarizedhisdatainthetablebelow.
SupplyArea Numberofhouses
choleradeaths
deathsper10,000houses
S&V 40,046 1,263 315
Lambeth 26,107 98 37
RestofLondon 256,423 1,422 59
ThenumberspointedaccusinglyatS&V.ThedeathratefromcholeraintheS&VhouseswasalmosttentimestherateinthehousessuppliedbyLambeth.
ComputationalandInferentialThinking
22Snow’s“GrandExperiment”
EstablishingCausalityInthelanguagedevelopedearlierinthesection,youcanthinkofthepeopleintheS&Vhousesasthetreatmentgroup,andthoseintheLambethhousesatthecontrolgroup.AcrucialelementinSnow’sanalysiswasthatthepeopleinthetwogroupswerecomparabletoeachother,apartfromthetreatment.
Inordertoestablishwhetheritwasthewatersupplythatwascausingcholera,Snowhadtocomparetwogroupsthatweresimilartoeachotherinallbutoneaspect–theirwatersupply.Onlythenwouldhebeabletoascribethedifferencesintheiroutcomestothewatersupply.Ifthetwogroupshadbeendifferentinsomeotherwayaswell,itwouldhavebeendifficulttopointthefingeratthewatersupplyasthesourceofthedisease.Forexample,ifthetreatmentgroupconsistedoffactoryworkersandthecontrolgroupdidnot,thendifferencesbetweentheoutcomesinthetwogroupscouldhavebeenduetothewatersupply,ortofactorywork,orboth,ortoanyothercharacteristicthatmadethegroupsdifferentfromeachother.Thefinalpicturewouldhavebeenmuchmorefuzzy.
Snow’sbrilliancelayinidentifyingtwogroupsthatwouldmakehiscomparisonclear.Hehadsetouttoestablishacausalrelationbetweencontaminatedwaterandcholerainfection,andtoagreatextenthesucceeded,eventhoughthemiasmatistsignoredandevenridiculedhim.Ofcourse,Snowdidnotunderstandthedetailedmechanismbywhichhumanscontractcholera.Thatdiscoverywasmadein1883,whentheGermanscientistRobertKochisolatedtheVibriocholerae,thebacteriumthatentersthehumansmallintestineandcausescholera.
InfacttheVibriocholeraehadbeenidentifiedin1854byFilippoPaciniinItaly,justaboutwhenSnowwasanalyzinghisdatainLondon.BecauseofthedominanceofthemiasmatistsinItaly,Pacini’sdiscoverylanguishedunknown.Butbytheendofthe1800’s,themiasmabrigadewasinretreat.SubsequenthistoryhasvindicatedPaciniandJohnSnow.Snow’smethodsledtothedevelopmentofthefieldofepidemiology,whichisthestudyofthespreadofdiseases.
Confounding
Letusnowreturntomoremoderntimes,armedwithanimportantlessonthatwehavelearnedalongtheway:
Inanobservationalstudy,ifthetreatmentandcontrolgroupsdifferinwaysotherthanthetreatment,itisdifficulttomakeconclusionsaboutcausality.
Anunderlyingdifferencebetweenthetwogroups(otherthanthetreatment)iscalledaconfoundingfactor,becauseitmightconfoundyou(thatis,messyouup)whenyoutrytoreachaconclusion.
ComputationalandInferentialThinking
23EstablishingCausality
Example:Coffeeandlungcancer.Studiesinthe1960’sshowedthatcoffeedrinkershadhigherratesoflungcancerthanthosewhodidnotdrinkcoffee.Becauseofthis,somepeopleidentifiedcoffeeasacauseoflungcancer.Butcoffeedoesnotcauselungcancer.Theanalysiscontainedaconfoundingfactor–smoking.Inthosedays,coffeedrinkerswerealsolikelytohavebeensmokers,andsmokingdoescauselungcancer.Coffeedrinkingwasassociatedwithlungcancer,butitdidnotcausethedisease.
Confoundingfactorsarecommoninobservationalstudies.Goodstudiestakegreatcaretoreduceconfounding.
ComputationalandInferentialThinking
24EstablishingCausality
RandomizationAnexcellentwaytoavoidconfoundingistoassignindividualstothetreatmentandcontrolgroupsatrandom,andthenadministerthetreatmenttothosewhowereassignedtothetreatmentgroup.Randomizationkeepsthetwogroupssimilarapartfromthetreatment.
Ifyouareabletorandomizeindividualsintothetreatmentandcontrolgroups,youarerunningarandomizedcontrolledexperiment,alsoknownasarandomizedcontrolledtrial(RCT).Sometimes,people’sresponsesinanexperimentareinfluencedbytheirknowingwhichgrouptheyarein.Soyoumightwanttorunablindexperimentinwhichindividualsdonotknowwhethertheyareinthetreatmentgrouporthecontrolgroup.Tomakethiswork,youwillhavetogivethecontrolgroupaplacebo,whichissomethingthatlooksexactlylikethetreatmentbutinfacthasnoeffect.
Randomizedcontrolledexperimentshavelongbeenagoldstandardinthemedicalfield,forexampleinestablishingwhetheranewdrugworks.Theyarealsobecomingmorecommonlyusedinotherfieldssuchaseconomics.
Example:WelfaresubsidiesinMexico.InMexicanvillagesinthe1990’s,childreninpoorfamilieswereoftennotenrolledinschool.Oneofthereasonswasthattheolderchildrencouldgotoworkandthushelpsupportthefamily.SantiagoLevy,aministerinMexicanMinistryofFinance,setouttoinvestigatewhetherwelfareprogramscouldbeusedtoincreaseschoolenrollmentandimprovehealthconditions.HeconductedanRCTonasetofvillages,selectingsomeofthematrandomtoreceiveanewwelfareprogramcalledPROGRESA.Theprogramgavemoneytopoorfamiliesiftheirchildrenwenttoschoolregularlyandthefamilyusedpreventivehealthcare.Moremoneywasgivenifthechildrenwereinsecondaryschoolthaninprimaryschool,tocompensateforthechildren’slostwages,andmoremoneywasgivenforgirlsattendingschoolthanforboys.Theremainingvillagesdidnotgetthistreatment,andformedthecontrolgroup.Becauseoftherandomization,therewerenoconfoundingfactorsanditwaspossibletoestablishthatPROGRESAincreasedschoolenrollment.Forboys,theenrollmentincreasedfrom73%inthecontrolgroupto77%inthePROGRESAgroup.Forgirls,theincreasewasevengreater,from67%inthecontrolgrouptoalmost75%inthePROGRESAgroup.Duetothesuccessofthisexperiment,theMexicangovernmentsupportedtheprogramunderthenewnameOPORTUNIDADES,asaninvestmentinahealthyandwelleducatedpopulation.
Insomesituationsitmightnotbepossibletocarryoutarandomizedcontrolledexperiment,evenwhentheaimistoinvestigatecausality.Forexample,supposeyouwanttostudytheeffectsofalcoholconsumptionduringpregnancy,andyourandomlyassignsomepregnant
ComputationalandInferentialThinking
25Randomization
womentoyour“alcohol”group.Youshouldnotexpectcooperationfromthemifyoupresentthemwithadrink.Insuchsituationsyouwillalmostinvariablybeconductinganobservationalstudy,notanexperiment.Bealertforconfoundingfactors.
ComputationalandInferentialThinking
26Randomization
EndnoteIntheterminologyofthatwehavedeveloped,JohnSnowconductedanobservationalstudy,notarandomizedexperiment.Buthecalledhisstudya“grandexperiment”because,ashewrote,“Nofewerthanthreehundredthousandpeople…weredividedintotwogroupswithouttheirchoice,andinmostcases,withouttheirknowledge…”
StudiessuchasSnow’saresometimescalled“naturalexperiments.”However,truerandomizationdoesnotsimplymeanthatthetreatmentandcontrolgroupsareselected“withouttheirchoice.”
Themethodofrandomizationcanbeassimpleastossingacoin.Itmayalsobequiteabitmorecomplex.Buteverymethodofrandomizationconsistsofasequenceofcarefullydefinedstepsthatallowchancestobespecifiedmathematically.Thishastwoimportantconsequences.
1. Itallowsustoaccount–mathematically–forthepossibilitythatrandomizationproducestreatmentandcontrolgroupsthatarequitedifferentfromeachother.
2. Itallowsustomakeprecisemathematicalstatementsaboutdifferencesbetweenthetreatmentandcontrolgroups.Thisinturnhelpsusmakejustifiableconclusionsaboutwhetherthetreatmenthasanyeffect.
Inthiscourse,youwilllearnhowtoconductandanalyzeyourownrandomizedexperiments.Thatwillinvolvemoredetailthanhasbeenpresentedinthissection.Fornow,justfocusonthemainidea:totrytoestablishcausality,runarandomizedcontrolledexperimentifpossible.Ifyouareconductinganobservationalstudy,youmightbeabletoestablishassociationbutnotcausation.Beextremelycarefulaboutconfoundingfactorsbeforemakingconclusionsaboutcausalitybasedonanobservationalstudy.
Terminology
observationalstudytreatmentoutcomeassociationcausalassociationcausalitycomparisontreatmentgroupcontrolgroupepidemiology
ComputationalandInferentialThinking
27Endnote
confoundingrandomizationrandomizedcontrolledexperimentrandomizedcontrolledtrial(RCT)blindplacebo
Funfacts
1. JohnSnowissometimescalledthefatherofepidemiology,buthewasananesthesiologistbyprofession.OneofhispatientswasQueenVictoria,whowasanearlyrecipientofanestheticsduringchildbirth.
2. FlorenceNightingale,theoriginatorofmodernnursingpracticesandfamousforherworkintheCrimeanWar,wasadie-hardmiasmatist.Shehadnotimefortheoriesaboutcontagionandgerms,andwasnotoneformincingherwords.“Thereisnoendtotheabsurditiesconnectedwiththisdoctrine,”shesaid.“Sufficeittosaythatintheordinarysenseoftheword,thereisnoproofsuchaswouldbeadmittedinanyscientificenquirythatthereisanysuchthingascontagion.”
3. AlaterRCTestablishedthattheconditionsonwhichPROGRESAinsisted–childrengoingtoschool,preventivehealthcare–werenotnecessarytoachieveincreasedenrollment.Justthefinancialboostofthewelfarepaymentswassufficient.
Goodreads
TheStrangeCaseoftheBroadStreetPump:JohnSnowandtheMysteryofCholerabySandraHempel,publishedbyourownUniversityofCaliforniaPress,readslikeawhodunit.Itwasoneofthemainsourcesforthissection'saccountofJohnSnowandhiswork.Awordofwarning:someofthecontentsofthebookarestomach-churning.
PoorEconomics,thebestsellerbyAbhijitV.BanerjeeandEstherDufloofMIT,isanaccessibleandlivelyaccountofwaystofightglobalpoverty.ItincludesnumerousexamplesofRCTs,includingthePROGRESAexampleinthissection.
ComputationalandInferentialThinking
28Endnote
ExpressionsInteractProgrammingcandramaticallyimproveourabilitytocollectandanalyzeinformationabouttheworld,whichinturncanleadtodiscoveriesthroughthekindofcarefulreasoningdemonstratedintheprevioussection.Indatascience,thepurposeofwritingaprogramistoinstructacomputertocarryoutthestepsofananalysis.Computerscannotstudytheworldontheirown.Peoplemustdescribepreciselywhatstepsthecomputershouldtakeinordertocollectandanalyzedata,andthosestepsareexpressedthroughprograms.
Tolearntoprogram,wewillbeginbylearningsomeofthegrammarrulesofaprogramminglanguage.Programsaremadeupofexpressions,whichdescribetothecomputerhowtocombinepiecesofdata.Forexample,amultiplicationexpressionconsistsofa*symbolbetweentwonumericalexpressions.Expressions,suchas3*4,areevaluatedbythecomputer.Thevalue(theresultofevaluation)ofthelastexpressionineachcell,12inthiscase,isdisplayedbelowthecell.
3*4
12
Therulesofaprogramminglanguagearerigid.InPython,the*symbolcannotappeartwiceinarow.Thecomputerwillnottrytointerpretanexpressionthatdiffersfromitsprescribedexpressionstructures.Instead,itwillshowaSyntaxErrorerror.TheSyntaxofalanguageisitssetofgrammarrules,andaSyntaxErrorindicatesthatsomerulehasbeenviolated.
3**4
File"<ipython-input-4-d90564f70db7>",line1
3**4
^
SyntaxError:invalidsyntax
ComputationalandInferentialThinking
29Expressions
Smallchangestoanexpressioncanchangeitsmeaningentirely.Below,thespacebetweenthe*'shasbeenremoved.Because**appearsbetweentwonumericalexpressions,theexpressionisawell-formedexponentiationexpression(thefirstnumberraisedtothepowerofthesecond:3times3times3times3).Thesymbols*and**arecalledoperators,andthevaluestheycombinearecalledoperands.
3**4
81
CommonOperators.Datascienceofteninvolvescombiningnumericalvalues,andthesetofoperatorsinaprogramminglanguagearedesignedtosothatexpressionscanbeusedtoexpressanysortofarithmetic.InPython,thefollowingoperatorsareessential.
ExpressionType Operator Example Value
Addition + 2+3 5
Subtraction - 2-3 -1
Multiplication * 2*3 6
Division / 7/3 2.66667
Remainder % 7%3 1
Exponentiation ** 2**0.5 1.41421
Pythonexpressionsobeythesamefamiliarrulesofprecedenceasinalgebra:multiplicationanddivisionoccurbeforeadditionandsubtraction.Parenthesescanbeusedtogrouptogethersmallerexpressionswithinalargerexpression.
1+2*3*4*5/6**3+7+8
16.555555555555557
1+2*((3*4*5/6)**3)+7+8
2016.0
ComputationalandInferentialThinking
30Expressions
Thischapterintroducesmanytypesofexpressions.Learningtoprograminvolvestryingouteverythingyoulearnincombination,investigatingthebehaviorofthecomputer.Whathappensifyoudividebyzero?Whathappensifyoudividetwiceinarow?Youdon'talwaysneedtoaskanexpert(ortheInternet);manyofthesedetailscanbediscoveredbytryingthemoutyourself.
ComputationalandInferentialThinking
31Expressions
NamesInteractNamesaregiventovaluesinPythonusinganassignmentexpression.Inanassignment,anameisfollowedby=,whichisfollowedbyanotherexpression.Thevalueoftheexpressiontotherightof=isassignedtothename.Onceanamehasavalueassignedtoit,thevaluewillbesubstitutedforthatnameinfutureexpressions.
a=10
b=20
a+b
30
Apreviouslyassignednamecanbeusedintheexpressiontotherightof=.
quarter=2
whole=4*quarter
whole
8
However,onlythecurrentvalueofanexpressionisassigedtoaname.Ifthatvaluechangeslater,namesthatweredefinedintermsofthatvaluewillnotchangeautomatically.
quarter=4
whole
8
Namesmuststartwithaletter,butcancontainbothlettersandnumbers.Anamecannotcontainaspace;instead,itiscommontouseanunderscorecharacter_toreplaceeachspace.Namesareonlyasusefulasyoumakethem;it'suptotheprogrammertochoose
ComputationalandInferentialThinking
32Names
namesthatareeasytointerpret.Typically,moremeaningfulnamescanbeinventedthanaandb.Forexample,todescribethesalestaxona$5purchaseinBerkeley,CA,thefollowingnamesclarifythemeaningofthevariousquantitiesinvolved.
purchase_price=5
state_tax_rate=0.075
county_tax_rate=0.02
city_tax_rate=0
sales_tax_rate=state_tax_rate+county_tax_rate+city_tax_rate
sales_tax=purchase_price*sales_tax_rate
sales_tax
0.475
ComputationalandInferentialThinking
33Names
Example:GrowthRatesInteractTherelationshipbetweentwomeasurementsofthesamequantitytakenatdifferenttimesisoftenexpressedasagrowthrate.Forexample,theUnitedStatesfederalgovernmentemployed2,766,000peoplein2002and2,814,000peoplein2012\footnotehttp://www.bls.gov/opub/mlr/2013/article/industry-employment-and-output-projections-to-2022-1.htm.Tocomputeagrowthrate,wemustfirstdecidewhichvaluetotreatastheinitialamount.Forvaluesovertime,theearliervalueisanaturalchoice.Then,wedividethedifferencebetweenthechangedandinitialamountbytheinitialamount.
initial=2766000
changed=2814000
(changed-initial)/initial
0.01735357917570499
Itismoretypicaltosubtractonefromtheratioofthetwomeasurements,whichyieldsthesamevalue.
(changed/initial)-1
0.017353579175704903
Thisvalueisthegrowthrateover10years.Ausefulpropertyofgrowthratesisthattheydon'tchangeevenifthevaluesareexpressedindifferentunits.So,forexample,wecanexpressthesamerelationshipbetweenthousandsofpeoplein2002and2012.
initial=2766
changed=2814
(changed/initial)-1
0.017353579175704903
ComputationalandInferentialThinking
34Example:GrowthRates
In10years,thenumberofemployeesoftheUSFederalGovernmenthasincreasedbyonly1.74%.Inthattime,thetotalexpendituresoftheUSFederalGovernmentincreasedfrom$2.37trillionto$3.38trillionin2012.
initial=2.37
changed=3.38
(changed/initial)-1
0.4261603375527425
A42.6%increaseinthefederalbudgetismuchlargerthanthe1.74%increaseinfederalemployees.Infact,thenumberoffederalemployeeshasgrownmuchmoreslowlythanthepopulationoftheUnitedStates,whichincreased9.21%inthesametimeperiodfrom287.6millionpeoplein2002to314.1millionin2012.
initial=287.6
changed=314.1
(changed/initial)-1
0.09214186369958277
Agrowthratecanbenegative,representingadecreaseinsomevalue.Forexample,thenumberofmanufacturingjobsintheUSdecreasedfrom15.3millionin2002to11.9millionin2012,a-22.2%growthrate.
initial=15.3
changed=11.9
(changed/initial)-1
-0.2222222222222222
Anannualgrowthrateisagrowthrateofsomequantityoverasingleyear.Anannualgrowthrateof0.035,accumulatedeachyearfor10years,givesamuchlargerten-yeargrowthrateof0.41(or41%).
ComputationalandInferentialThinking
35Example:GrowthRates
1.035*1.035*1.035*1.035*1.035*1.035*1.035*1.035*1.035
0.410598760621121
Thissamecomputationcanbeexpressedusingnamesandexponentsforclarity.
annual_growth_rate=0.035
ten_year_growth_rate=(1+annual_growth_rate)**10-1
ten_year_growth_rate
0.410598760621121
Likewise,aten-yeargrowthratecanbeusedtocomputeanequivalentannualgrowthrate.Below,tisthenumberofyearsthathavepassedbetweenmeasurements.Thefollowingcomputestheannualgrowthrateoffederalexpendituresoverthelast10years.
initial=2.37
changed=3.38
t=10
(changed/initial)**(1/t)-1
0.03613617208346853
Thetotalgrowthover10yearsisequivalenttoa3.6%increaseeachyear.
ComputationalandInferentialThinking
36Example:GrowthRates
CallExpressionsInteractCallexpressionsinvokefunctions,whicharenamedoperations.Thenameofthefunctionappearsfirst,followedbyexpressionsinparentheses.
abs(-12)
12
round(5-1.3)
4
max(2,2+3,4)
5
Inthislastexample,themaxfunctioniscalledonthreearguments:2,5,and4.Thevalueofeachexpressionwithinparenthesesispassedtothefunction,andthefunctionreturnsthefinalvalueofthefullcallexpression.Themaxfunctioncantakeanynumberofargumentsandreturnsthemaximum.
Afewfunctionsareavailablebydefault,suchasabsandround,butmostfunctionsthatarebuiltintothePythonlanguagearestoredinacollectionoffunctionscalledamodule.Animportstatementisusedtoprovideaccesstoamodule,suchasmathoroperator.
importmath
importoperator
math.sqrt(operator.add(4,5))
3.0
ComputationalandInferentialThinking
37CallExpressions
Anequivalentexpressioncouldbeexpressedusingthe+and**operatorsinstead.
(4+5)**0.5
3.0
Operatorsandcallexpressionscanbeusedtogetherinanexpression.Thepercentdifferencebetweentwovaluesisusedtocomparevaluesforwhichneitheroneisobviouslyinitialorchanged.Forexample,in2014Floridafarmsproduced2.72billioneggswhileIowafarmsproduced16.25billioneggs\footnotehttp://quickstats.nass.usda.gov/.Thepercentdifferenceis100timestheabsolutevalueofthedifferencebetweenthevalues,dividedbytheiraverage.Inthiscase,thedifferenceislargerthantheaverage,andsothepercentdifferenceisgreaterthan100.
florida=2.72
iowa=16.25
100*abs(florida-iowa)/((florida+iowa)/2)
142.6462836056932
Learninghowdifferentfunctionsbehaveisanimportantpartoflearningaprogramminglanguage.AJupyternotebookcanassistinrememberingthenamesandeffectsofdifferentfunctions.Wheneditingacodecell,pressthetabkeyaftertypingthebeginningofanametobringupalistofwaystocompletethatname.Forexample,presstabaftermath.toseeallofthefunctionsavailableinthemathmodule.Typingwillnarrowdownthelistofoptions.Tolearnmoreaboutafunction,placea?afteritsname.Forexample,typingmath.log?willbringupadescriptionofthelogfunctioninthemathmodule.
math.log?
log(x[,base])
Returnthelogarithmofxtothegivenbase.
Ifthebasenotspecified,returnsthenaturallogarithm(basee)ofx.
Thesquarebracketsintheexamplecallindicatethatanargumentisoptional.Thatis,logcanbecalledwitheitheroneortwoarguments.
ComputationalandInferentialThinking
38CallExpressions
math.log(16,2)
4.0
math.log(16)/math.log(2)
4.0
ComputationalandInferentialThinking
39CallExpressions
DataTypesInteractPythonincludesseveraltypesofvaluesthatwillallowustocomputeaboutmorethanjustnumbers.
TypeName PythonClass Example1 Example2
Integer int 1 -2
Floating-PointNumber float 1.5 -2.0
Boolean bool True False
String str 'helloworld' "False"
Thefirsttwotypesrepresentnumbers.A"floating-point"numberdescribesitsrepresentationbyacomputer,whichisatopicforanothertext.Practically,thedifferencebetweenanintegerandafloating-pointnumberisthatthelattercanrepresentfractionsinadditiontowholevalues.
Booleanvalues,namedforthelogicianGeorgeBoole,representtruthandtakeonlytwopossiblevalues:TrueandFalse.
Strings¶Stringsarethemostflexibletypeofall.Theycancontainarbitrarytext.Astringcandescribeanumberoratruthvalue,aswellasanywordorsentence.Muchoftheworld'sdataistext,andtextisrepresentedincomputersasstrings.
Themeaningofanexpressiondependsbothuponitsstructureandthetypesofvaluesthatarebeingcombined.So,forinstance,addingtwostringstogetherproducesanotherstring.Thisexpressionisstillanadditionexpression,butitiscombiningadifferenttypeofvalue.
"data"+"science"
'datascience'
Additioniscompletelyliteral;itcombinesthesetwostringstogetherwithoutregardfortheircontents.Itdoesn'taddaspacebecausethesearedifferentwords;that'suptotheprogrammer(you)tospecify.
ComputationalandInferentialThinking
40DataTypes
"data"+"sc"+"ienc"+"e"
'datascience'
Singleanddoublequotescanbothbeusedtocreatestrings:'hi'and"hi"areidenticalexpressions.Doublequotesareoftenpreferredbecausetheyallowyoutoincludeapostrophesinsideofstrings.
"Thiswon'tworkwithasingle-quotedstring!"
"Thiswon'tworkwithasingle-quotedstring!"
Whynot?Tryitout.
Thestrfunctionreturnsastringrepresentationofanyvalue.Usingthisfunction,stringscanbeconstructedthathaveembeddedvalues.
"That's"+str(1+1)+''+str(True)
"That's2True"
StringMethods¶
Fromanexistingstring,relatedstringscanbeconstructedusingstringmethods,whicharefunctionsthatoperateonstrings.Thesemethodsarecalledbyplacingadotafterthestring,thencallingthefunction.
Forexample,thefollowingmethodgeneratesanuppercasedversionofastring.
"loud".upper()
'LOUD'
Perhapsthemostimportantmethodisreplace,whichreplacesallinstancesofasubstringwithinthestring.Thereplacemethodtakestwoarguments,thetexttobereplacedanditsreplacement.
ComputationalandInferentialThinking
41DataTypes
'hitchhiker'.replace('hi','ma')
'matchmaker'
Stringmethodscanalsobeinvokedusingvariablenames,aslongasthosenamesareboundtostrings.So,forinstance,thefollowingtwo-stepprocessgeneratestheword"degrade"startingfrom"train"byfirstcreating"ingrain"andthenapplyingasecondreplacement.
s="train"
t=s.replace('t','ing')
u=t.replace('in','de')
u
'degrade'
ComputationalandInferentialThinking
42DataTypes
ComparisonsInteractBooleanvaluesmostoftenarisefromcomparisonoperators.Pythonincludesavarietyofoperatorsthatcomparevalues.Forexample,3islargerthan1+1.
3>1+1
True
ThevalueTrueindicatesthatthecomparisonisvalid;Pythonhasconfirmedthissimplefactabouttherelationshipbetween3and1+1.Thefullsetofcommoncomparisonoperatorsarelistedbelow.
Comparison Operator Trueexample FalseExample
Lessthan < 2<3 2<2
Greaterthan > 3>2 3>3
Lessthanorequal <= 2<=2 3<=2
Greaterorequal >= 3>=3 2>=3
Equal == 3==3 3==2
Notequal != 3!=2 2!=2
Anexpressioncancontainmultiplecomparisons,andtheyallmustholdinorderforthewholeexpressiontobeTrue.Forexample,wecanexpressthat1+1isbetween1and3usingthefollowingexpression.
1<1+1<3
True
Theaverageoftwonumbersisalwaysbetweenthesmallernumberandthelargernumber.Weexpressthisrelationshipforthenumbersxandybelow.Youcantrydifferentvaluesofxandytoconfirmthisrelationship.
ComputationalandInferentialThinking
43Comparisons
x=12
y=5
min(x,y)<=(x+y)/2<=max(x,y)
True
Stringscanalsobecompared,andtheirorderisalphabetical.Ashorterstringislessthanalongerstringthatbeginswiththeshorterstring.
"Dog">"Catastrophe">"Cat"
True
ComputationalandInferentialThinking
44Comparisons
SequencesInteractValuescanbegroupedtogetherintocollections,whichallowsprogrammerstoorganizethosevaluesandrefertoallofthemwithasinglename.Separatingexpressionsbycommaswithinsquarebracketsplacesthevaluesofthoseexpressionsintoalist,whichisabuilt-insequentialcollection.Afteracomma,it'sokaytostartanewline.Below,wecollectfourdifferenttemperaturesintoalistcalledtemps.ThesearetheestimatedaveragedailyhightemperaturesoveralllandonEarth(indegreesCelsius)forthedecadessurrounding1850,1900,1950,and2000,respectively,expressedasdeviationsfromtheaverageabsolutehightemperaturebetween1951and1980,whichwas14.48degrees.\footnotehttp://berkeleyearth.lbl.gov/regions/global-land
baseline_high=14.48
highs=[baseline_high-0.880,baseline_high-0.093,
baseline_high+0.105,baseline_high+0.684]
highs
[13.6,14.387,14.585,15.164]
Listsallowustopassmultiplevaluesintoafunctionusingasinglename.Forinstance,thesumfunctioncomputesthesumofallvaluesinalist,andthelenfunctioncomputesthelengthofthelist.Usingthemtogether,wecancomputetheaverageofthelist.
sum(highs)/len(highs)
14.434000000000001
Tuples.Valuescanalsobegroupedtogetherintotuplesinadditiontolists.Atupleisexpressedbyseparatingelementsusingcommaswithinparentheses:(1,2,3).Inmanycases,listsandtuplesareinterchangeable.Wewilluseatupletoholdtheaveragedailylowtempuraturesforthesamefourtimeranges.Theonlychangeinthecodeistheswitchtoparenthesesinsteadofbrackets.
ComputationalandInferentialThinking
45Sequences
baseline_low=3.00
lows=(baseline_low-0.872,baseline_low-0.629,
baseline_low-0.126,baseline_low+0.728)
lows
(2.128,2.371,2.874,3.7279999999999998)
Thecompletechartofdailyhighandlowtemperaturesappearsbelow.
MeanofDailyHighTemperature¶
MeanofDailyLowTemperature¶
ComputationalandInferentialThinking
46Sequences
ComputationalandInferentialThinking
47Sequences
ArraysInteractManyexperimentsanddatasetsinvolvemultiplevaluesofthesametype.Anarrayisacollectionofvaluesthatallhavethesametype.Thenumpypackage,abbreviatednpinprograms,providesPythonprogrammerswithconvenientandpowerfulfunctionsforcreatingandmanipulatingarrays.
importnumpyasnp
Anarrayiscreatedusingthenp.arrayfunction,whichtakesalistortupleasanargument.Here,wecreatearraysofaveragedailyhighandlowtemperaturesforthedecadessurrounding1850,1900,1950,and2000.
baseline_high=14.48
highs=np.array([baseline_high-0.880,
baseline_high-0.093,
baseline_high+0.105,
baseline_high+0.684])
highs
array([13.6,14.387,14.585,15.164])
baseline_low=3.00
lows=np.array((baseline_low-0.872,baseline_low-0.629,
baseline_low-0.126,baseline_low+0.728))
lows
array([2.128,2.371,2.874,3.728])
Arraysdifferfromlistsandtuplesbecausetheycanbeusedinarithmeticexpressionstocomputeovertheircontents.Whenanarrayiscombinedwithasinglenumber,thatnumberiscombinedwitheachelementofthearray.Therefore,wecanconvertallofthesetemperaturestoFarenheitbywritingthefamiliarconversionformula.
ComputationalandInferentialThinking
48Arrays
(9/5)*highs+32
array([56.48,57.8966,58.253,59.2952])
Whentwoarraysarecombinedtogetherusinganarithmeticoperator,theirindividualvaluesarecombined.Forexample,wecancomputethedailytemperaturerangesbysubtracting.
highs-lows
array([11.472,12.016,11.711,11.436])
Arraysalsohavemethods,whicharefunctionsthatoperateonthearrayvalues.Themeanofacollectionofnumbersisitsaveragevalue:thesumdividedbythelength.Eachpairofparenthesesintheexamplebelowispartofacallexpression;it'scallingafunctionwithnoargumentstoperformacomputationonthearraycalledtemps.
[highs.size,highs.sum(),highs.mean()]
[4,57.736000000000004,14.434000000000001]
It'scommonthatsmallroundingerrorsappearinarithmeticonacomputer.Forinstance,themean(average)aboveisinfact14.434,butanextra0.0000000000000001hasbeenaddedduringthecomputationoftheaverage.Arithmeticroundingerrorsareanimportanttopicinscientificcomputing,butwon'tbeafocusofthistext.Everyonewhoworkswithdatamustbeawarethatsomeoperationswillbeapproximatedsothatvaluescanbestoredandmanipulatedefficiently.
FunctionsonArrays¶
Inadditiontobasicarithmeticandmethodssuchasminandmax,manipulatingarraysofteninvolvescallingfunctionsthatarepartofthenppackage.Forexample,thedifffunctioncomputesthedifferencebetweeneachadjacentpairofelementsinanarray.Thefirstelementofthediffisthesecondelementminusthefirst.
np.diff(highs)
ComputationalandInferentialThinking
49Arrays
array([0.787,0.198,0.579])
ThefullNumpyreferenceliststhesefunctionsexhaustively,butonlyasmallsubsetareusedcommonlyfordataprocessingapplications.Thesearegroupedintodifferentpackageswithinnp.LearningthisvocabularyisanimportantpartoflearningthePythonlanguage,soreferbacktothislistoftenasyouworkthroughexamplesandproblems.
Eachofthesefunctionstakesanarrayasanargumentandreturnsasinglevalue.
Function Description
np.prod Multiplyallelementstogether
np.sum Addallelementstogether
np.allTestwhetherallelementsaretruevalues(non-zeronumbersaretrue)
np.anyTestwhetheranyelementsaretruevalues(non-zeronumbersaretrue)
np.count_nonzero Countthenumberofnon-zeroelements
Eachofthesefunctionstakesanarrayasanargumentandreturnsanarrayofvalues.
Function Description
np.diff Differencebetweenadjacentelements
np.round Roundeachnumbertothenearestinteger(wholenumber)
np.cumprod Acumulativeproduct:foreachelement,multiplyallelementssofar
np.cumsum Acumulativesum:foreachelement,addallelementssofar
np.exp Exponentiateeachelement
np.log Takethenaturallogarithmofeachelement
np.sqrt Takethesquarerootofeachelement
np.sort Sorttheelements
Eachofthesefunctionstakesanarrayofstringsandreturnsanarray.
ComputationalandInferentialThinking
50Arrays
Function Description
np.char.lower Lowercaseeachelement
np.char.upper Uppercaseeachelement
np.char.strip Removespacesatthebeginningorendofeachelement
np.char.isalpha Whethereachelementisonlyletters(nonumbersorsymbols)
np.char.isnumeric Whethereachelementisonlynumeric(noletters)
Eachofthesefunctionstakesbothanarrayofstringsandasearchstring;eachreturnsanarray.
Function Description
np.char.countCountthenumberoftimesasearchstringappearsamongtheelementsofanarray
np.char.findThepositionwithineachelementthatasearchstringisfoundfirst
np.char.rfindThepositionwithineachelementthatasearchstringisfoundlast
np.char.startswith Whethereachelementstartswiththesearchstring
ComputationalandInferentialThinking
51Arrays
RangesInteractArangeisanarrayofnumbersinincreasingordecreasingorder,eachseparatedbyaregularinterval.Rangesaredefinedusingthenp.arangefunction,whichtakeseitherone,two,orthreearguments.
np.arange(end):Anarraystartingwith0ofincreasingintegersuptoend
np.arange(start,end):Anarrayofincreasingintegersfromstartuptoend
np.arange(start,end,step):Arangewithstepbetweeneachpairofconsecutivevalues
Arangealwaysincludesitsstartvalue,butdoesnotincludeitsendvalue.Thestepcanbeeitherpositiveornegativeandmaybeawholenumberorafraction.
by_four=np.arange(1,100,4)
by_four
array([1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,
69,73,77,81,85,89,93,97])
Rangeshavemanyuses.Forinstance,arangecanbeusedtocomputepartoftheLeibnizformulaforπ,whichistypicallywrittenas
4*sum(1/by_four-1/(by_four+2))
3.1215946525910097
TheBirthdayProblem¶
Therearekstudentsinaclass.Whatisthechancethatatleasttwoofthestudentshavethesamebirthday?
Assumptions
ComputationalandInferentialThinking
52Ranges
1. Noleapyears;everyyearhas365days2. Birthsaredistributedevenlythroughouttheyear3. Nostudent'sbirthdayisaffectedbyanyother(e.g.twins)
Let'sstartwithaneasycase:kis4.We'llfirstfindthechancethatallfourpeoplehavedifferentbirthdays.
all_different=(364/365)*(363/365)*(362/365)
Thechancethatthereisatleastonepairwiththesamebirthdayisequivalenttothechancethatthebirthdaysarenotalldifferent.Giventhechanceofsomeeventoccurring,thechancethattheeventdoesnotoccurisoneminusthechancethatitdoes.Withonly4people,thechancethatanytwohavethesamebirthdayislessthan2%.
1-all_different
0.016355912466550215
Usingarange,wecanexpressthissamecomputationcompactly.Webeginwiththenumeratorsofeachfactor.
k=4
numerators=np.arange(364,365-k,-1)
numerators
array([364,363,362])
Then,wedivideeachnumeratorby365toformthefactors,multiplythemalltogether,andsubtractfrom1.
1-np.prod(numerators/365)
0.016355912466550215
Ifkis40,thechanceoftwobirthdaysbeingthesameismuchhigher:almost90%!
ComputationalandInferentialThinking
53Ranges
k=40
numerators=np.arange(364,365-k,-1)
1-np.prod(numerators/365)
0.89123180981794903
Usingranges,wecaninvestigatehowthischancechangesaskincreases.Thenp.cumprodfunctioncomputesthecumulativeproductofanarray.Thatis,itcomputesanewarraythathasthesamelengthasitsinput,buttheithelementistheproductofthefirstiterms.Below,thefourthtermoftheresultis1*2*3*4.
ten=np.arange(1,9)
np.cumprod(ten)
array([1,2,6,24,120,720,5040,40320])
Thefollowingcellcomputesthechanceofmatchingbirthdaysforeveryclasssizefrom2upto365.Scrollingthroughtheresult,youwillseethataskincreases,thechanceofmatchingbirthdaysreaches1longbeforetheendofthearray.Infact,foranyksmallerthan365,thereisachancethatallkstudentsinaclasscanhavedifferentbirthdays.Thechanceissosmall,however,thatthedifferencefrom1isroundedawaybythecomputer.
numerators=np.arange(364,0,-1)
chances=1-np.cumprod(numerators/365)
chances
array([0.00273973,0.00820417,0.01635591,0.02713557,0.04046248,
0.0562357,0.07433529,0.09462383,0.11694818,0.14114138,
0.16702479,0.19441028,0.22310251,0.25290132,0.28360401,
0.31500767,0.34691142,0.37911853,0.41143838,0.44368834,
0.47569531,0.50729723,0.53834426,0.5686997,0.59824082,
0.62685928,0.65446147,0.68096854,0.70631624,0.73045463,
0.75334753,0.77497185,0.79531686,0.81438324,0.83218211,
0.84873401,0.86406782,0.87821966,0.89123181,0.90315161,
0.91403047,0.92392286,0.93288537,0.9409759,0.94825284,
0.9547744,0.96059797,0.96577961,0.97037358,0.97443199,
ComputationalandInferentialThinking
54Ranges
0.97800451,0.98113811,0.98387696,0.98626229,0.98833235,
0.99012246,0.99166498,0.99298945,0.99412266,0.9950888,
0.99590957,0.99660439,0.99719048,0.99768311,0.9980957,
0.99844004,0.99872639,0.99896367,0.99915958,0.99932075,
0.99945288,0.99956081,0.99964864,0.99971988,0.99977744,
0.99982378,0.99986095,0.99989067,0.99991433,0.99993311,
0.99994795,0.99995965,0.99996882,0.999976,0.99998159,
0.99998593,0.99998928,0.99999186,0.99999385,0.99999537,
0.99999652,0.9999974,0.99999806,0.99999856,0.99999893,
0.99999922,0.99999942,0.99999958,0.99999969,0.99999978,
0.99999984,0.99999988,0.99999992,0.99999994,0.99999996,
0.99999997,0.99999998,0.99999998,0.99999999,0.99999999,
0.99999999,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
ComputationalandInferentialThinking
55Ranges
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.,1.,
1.,1.,1.,1.])
Theitemmethodofanarrayallowsustoselectaparticularelementfromthearraybyitsposition.Thestartingpositionofanarrayisitem(0),sofindingthechanceofmatchingbirthdaysina40-personclassinvolvesextractingitem(40-2)(becausethestartingitemisfora2-personclass).
chances.item(40-2)
0.891231809817949
ComputationalandInferentialThinking
56Ranges
TablesInteractTablesareafundamentalobjecttypeforrepresentingdatasets.Atablecanbeviewedintwoways.Tablesareasequenceofnamedcolumnsthateachdescribeasingleaspectofallentriesinadataset.Tablesarealsoasequenceofrowsthateachcontainallinformationaboutasingleentryinadataset.
Inordertousetables,importallofthemodulecalleddatascience,amodulecreatedforthistext.
fromdatascienceimport*
EmptytablescanbecreatedusingtheTablefunction,whichoptionallytakesalistofcolumnlabels.Thewith_rowandwith_rowsmethodsofatablereturnnewtableswithoneormoreadditionalrows.Forexample,wecouldcreateatableofOddandEvennumbers.
Table(['Odd','Even']).with_row([3,4])
Odd Even
3 4
Table(['Odd','Even']).with_rows([[3,4],[5,6],[7,8]])
Odd Even
3 4
5 6
7 8
Tablescanalsobeextendedwithadditionalcolumns.Thewith_columnandwith_columnsmethodsreturnnewtableswithadditionallabeledcolumns.Below,webegineachexamplewithanemptytablethathasnocolumns.
Table().with_column('Odd',[3,5,7])
ComputationalandInferentialThinking
57Tables
Odd
3
5
7
Table().with_columns([
'Odd',[3,5,7],
'Even',[4,6,8]
])
Odd Even
3 4
5 6
7 8
Tablesaremoretypicallycreatedfromfilesthatcontaincomma-separatedvalues,calledCSVfiles.Thefilebelowcontains"AnnualEstimatesoftheResidentPopulationbySingleYearofAgeandSexfortheUnitedStates."
census_url='http://www.census.gov/popest/data/national/asrh/2014/files/NC-EST2014-AGESEX-RES.csv'
full_census_table=Table.read_table(census_url)
full_census_table
SEX AGE CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010
0 0 3944153 3944160 3951330
0 1 3978070 3978090 3957888
0 2 4096929 4096939 4090862
0 3 4119040 4119051 4111920
0 4 4063170 4063186 4077552
0 5 4056858 4056872 4064653
0 6 4066381 4066412 4073013
0 7 4030579 4030594 4043047
0 8 4046486 4046497 4025604
0 9 4148353 4148369 4125415
ComputationalandInferentialThinking
58Tables
...(296rowsomitted)
Adescriptionofthetableappearsonline.TheSEXcolumncontainsnumericcodes:0standsforthetotal,1formale,and2forfemale.TheAGEcolumncontainsages,butthespecialvalue999isasumofthetotalpopulation.TherestofthecolumnscontainestimatesoftheUSpopulation.
Typically,apublictablewillcontainmoreinformationthannecessaryforaparticularinvestigationoranalysis.Inthiscase,letussupposethatweareonlyinterestedinthepopulationchangesfrom2010to2014.Wecanselectonlyasubsetofthecolumnsusingtheselectmethod.
partial_census_table=full_census_table.select(['SEX','AGE','POPESTIMATE2010'
partial_census_table
SEX AGE POPESTIMATE2010 POPESTIMATE2014
0 0 3951330 3948350
0 1 3957888 3962123
0 2 4090862 3957772
0 3 4111920 4005190
0 4 4077552 4003448
0 5 4064653 4004858
0 6 4073013 4134352
0 7 4043047 4154000
0 8 4025604 4119524
0 9 4125415 4106832
...(296rowsomitted)
Therelabeledmethodcreatesanalternativeversionofthetablethatreplacesacolumnlable.Usingthismethod,wecansimplifythelabelsoftheselectedcolumns.
simple=partial_census_table.relabeled('POPESTIMATE2010','2010').
simple
ComputationalandInferentialThinking
59Tables
SEX AGE 2010 2014
0 0 3951330 3948350
0 1 3957888 3962123
0 2 4090862 3957772
0 3 4111920 4005190
0 4 4077552 4003448
0 5 4064653 4004858
0 6 4073013 4134352
0 7 4043047 4154000
0 8 4025604 4119524
0 9 4125415 4106832
...(296rowsomitted)
Thecolumnmethodreturnsanarrayofthevaluesinaparticularcolumn.Eachcolumnofatableisanarrayofthesamelength,andsocolumnscanbecombinedusingarithmetic.
simple.column('2014')-simple.column('2010')
array([-2980,4235,-133090,...,6717,13410,4662996])
Atablewithanadditionalcolumncanbecreatedfromanexistingtableusingthewith_columnmethod,whichtakesalabelandasequenceofvaluesasarguments.Theremustbeeitherasmanyvaluesasthereareexistingrowsinthetableorasinglevalue(whichfillsthecolumnwiththatvalue).
change=simple.column('2014')-simple.column('2010')
annual_growth_rate=(simple.column('2014')/simple.column('2010'))
census=simple.with_columns(['Change',change,'Growth',annual_growth_rate
census
ComputationalandInferentialThinking
60Tables
SEX AGE 2010 2014 Change Growth
0 0 3951330 3948350 -2980 -0.000188597
0 1 3957888 3962123 4235 0.000267397
0 2 4090862 3957772 -133090 -0.00823453
0 3 4111920 4005190 -106730 -0.0065532
0 4 4077552 4003448 -74104 -0.00457471
0 5 4064653 4004858 -59795 -0.00369821
0 6 4073013 4134352 61339 0.00374389
0 7 4043047 4154000 110953 0.00679123
0 8 4025604 4119524 93920 0.00578232
0 9 4125415 4106832 -18583 -0.00112804
...(296rowsomitted)
Althoughthecolumnsofthistablearesimplyarraysofnumbers,theformatofthosenumberscanbechangedtoimprovetheinterpretabilityofthetable.Theset_formatmethodtakesFormatterobjects,whichexistfordates(DateFormatter),currencies(CurrencyFormatter),numbers,andpercentages.
census.set_format('Growth',PercentFormatter)
census.set_format(['2010','2014','Change'],NumberFormatter)
SEX AGE 2010 2014 Change Growth
0 0 3,951,330 3,948,350 -2,980 -0.02%
0 1 3,957,888 3,962,123 4,235 0.03%
0 2 4,090,862 3,957,772 -133,090 -0.82%
0 3 4,111,920 4,005,190 -106,730 -0.66%
0 4 4,077,552 4,003,448 -74,104 -0.46%
0 5 4,064,653 4,004,858 -59,795 -0.37%
0 6 4,073,013 4,134,352 61,339 0.37%
0 7 4,043,047 4,154,000 110,953 0.68%
0 8 4,025,604 4,119,524 93,920 0.58%
0 9 4,125,415 4,106,832 -18,583 -0.11%
...(296rowsomitted)
ComputationalandInferentialThinking
61Tables
Theinformationinatablecanbeaccessedinmanyways.Theattributesdemonstratedbelowcanbeusedforanytable.
census.labels
('SEX','AGE','2010','2014','Change','Growth')
census.num_columns
6
census.num_rows
306
census.row(5)
Row(SEX=0,AGE=5,2010=4064653,2014=4004858,Change=-59795,Growth=-0.0036982077956316806)
census.column(2)
array([3951330,3957888,4090862,...,26074,45058,
157257573])
Anindividualiteminatablecanbeaccessedviaaroworacolumn.
census.row(0).item(2)
3951330
ComputationalandInferentialThinking
62Tables
census.column(2).item(0)
3951330
Let'stakealookatthegrowthratesofthetotalnumberofmalesandfemalesbyselectingonlytherowsthatsumoverallages.Thissumisexpressedwiththespecialvalue999accordingtothisdatasetdescription.
census.where('AGE',999)
SEX AGE 2010 2014 Change Growth
0 999 309,347,057 318,857,056 9,509,999 0.76%
1 999 152,089,484 156,936,487 4,847,003 0.79%
2 999 157,257,573 161,920,569 4,662,996 0.73%
Whatagesofmalesaredrivingthisrapidgrowth?Wecanfirstfilterthecensustabletokeeponlythemaleentries,thensortbygrowthrateindecreasingorder.
males=census.where('SEX',1)
males.sort('Growth',descending=True)
SEX AGE 2010 2014 Change Growth
1 99 6,104 9,037 2,933 10.31%
1 100 9,351 13,729 4,378 10.08%
1 98 9,504 13,649 4,145 9.47%
1 93 60,182 85,980 25,798 9.33%
1 96 22,022 31,235 9,213 9.13%
1 94 43,828 62,130 18,302 9.12%
1 97 14,775 20,479 5,704 8.50%
1 95 31,736 42,824 11,088 7.78%
1 91 104,291 138,080 33,789 7.27%
1 92 83,462 109,873 26,411 7.12%
...(92rowsomitted)
ComputationalandInferentialThinking
63Tables
ThefactthattherearemoremenwithAGEof100than99lookssuspicious;shouldn'ttherebefewer?Acarefullookatthedescriptionofthedatasetrevealsthatthe100categoryactuallyincludesallmenwhoare100orolder.Thegrowthratesinmenattheseveryoldagescouldhaveseveralexplanations,suchasalargeinfluxfromanothercountry,butthemostnaturalexplanationisthatpeoplearesimplylivinglongerin2014than2010.
Thewheremethodcanalsotakeanarrayofbooleanvalues,constructedbyapplyingsomecomparisonoperatortoacolumnofthetable.Forexample,wecanfindalloftheagegroupsamongmalesforwhichtheabsoluteChangeissubstantial.Theshowmethoddisplaysallrowswithoutabbreviating.
males.where(males.column('Change')>200000).show()
SEX AGE 2010 2014 Change Growth
1 23 2,151,095 2,399,883 248,788 2.77%
1 24 2,161,380 2,391,398 230,018 2.56%
1 34 1,908,761 2,192,455 283,694 3.52%
1 57 1,910,028 2,110,149 200,121 2.52%
1 59 1,779,504 2,006,900 227,396 3.05%
1 64 1,291,843 1,661,474 369,631 6.49%
1 65 1,272,693 1,607,688 334,995 6.02%
1 66 1,239,805 1,589,127 349,322 6.40%
1 67 1,270,148 1,653,257 383,109 6.81%
1 71 903,263 1,169,356 266,093 6.67%
1 999 152,089,484 156,936,487 4,847,003 0.79%
Byfarthelargestchangesareclumpedtogetherinthe64-67range.In2014,thesepeoplewouldbebornbetween1947and1951,theheightofthepost-WWIIbabyboomintheUnitedStates.
Itispossibletospecifymultipleconditionsusingthefunctionsnp.logical_andandnp.logical_or.Whentwoconditionsarecombinedwithlogical_and,bothmustbetrueforarowtoberetained.Whenconditionsarecombinedwithlogical_or,theneitheroneofthemmustbetrueforarowtoberetained.Herearetwodifferentwaystoselect18and19yearolds.
ComputationalandInferentialThinking
64Tables
both=census.where(census.column('SEX')!=0)
both.where(np.logical_or(both.column('AGE')==18,
both.column('AGE')==19))
SEX AGE 2010 2014 Change Growth
1 18 2,305,733 2,165,062 -140,671 -1.56%
1 19 2,334,906 2,220,790 -114,116 -1.24%
2 18 2,185,272 2,060,528 -124,744 -1.46%
2 19 2,236,479 2,105,604 -130,875 -1.50%
both.where(np.logical_and(both.column('AGE')>=18,
both.column('AGE')<=19))
SEX AGE 2010 2014 Change Growth
1 18 2,305,733 2,165,062 -140,671 -1.56%
1 19 2,334,906 2,220,790 -114,116 -1.24%
2 18 2,185,272 2,060,528 -124,744 -1.46%
2 19 2,236,479 2,105,604 -130,875 -1.50%
Here,weobservearatherdramaticdecreaseinthenumberof18and19yearoldsintheUnitedStates;thechildrenofthebabyboomgenerationarenowintheir20's,leavingfewerteenagersinAmerica.
AggregationandGrouping.Thisparticulardatasetincludesentriesforsumsacrossallagesandsexes,usingthespecialvalues999and0respectively.However,iftheserowsdidnotexist,wewouldbeabletoaggregatetheremainingentriesinordertocomputethosesameresults.
both.where('AGE',999).select(['SEX','2014'])
SEX 2014
1 156,936,487
2 161,920,569
ComputationalandInferentialThinking
65Tables
no_sums=both.select(['SEX','AGE','2014']).where(both.column('AGE'
females=no_sums.where('SEX',2).select(['AGE','2014'])
females
AGE 2014
0 1,930,493
1 1,938,870
2 1,935,270
3 1,956,572
4 1,959,950
5 1,961,391
6 2,024,024
7 2,031,760
8 2,014,402
9 2,009,560
...(91rowsomitted)
sum(females['2014'])
161920569
Somecolumnsexpresscategories,suchasthesexoragegroupinthecaseofthecensustable.Aggregationcanalsobeperformedoneveryvalueforacategoryusingthegroupandgroupsmethods.Thegroupmethodtakesasinglecolumn(orcolumnname)asanargumentandcollectsallvaluesassociatedwitheachuniquevalueinthatcolumn.
Thesecondargumenttogroup,thenameofafunction,specifieshowtoaggregatetheresultingvalues.Forexample,thesumfunctioncanbeusedtoaddtogetherthepopulationsforeachage.Inthisresult,theSEXsumcolumnismeaninglessbecausethevaluesweresimplycodesfordifferentcategories.However,the2014sumcolumndoesinfactcontainthetotalnumberacrossallsexesforeachagecategory.
no_sums.select(['AGE','2014']).group('AGE',sum)
ComputationalandInferentialThinking
66Tables
AGE 2014sum
0 3948350
1 3962123
2 3957772
3 4005190
4 4003448
5 4004858
6 4134352
7 4154000
8 4119524
9 4106832
...(91rowsomitted)
JoiningTables.Therearemanysituationsindataanalysisinwhichtwodifferentrowsneedtobeconsideredtogether.Twotablescanbejoinedintoone,anoperationthatcreatesonelongrowoutoftwomatchingrows.Thesematchingrowscanbefromthesametableordifferenttables.
Forexample,wemightwanttoestimatewhichagecategoriesareexpectedtochangesignificantlyinthefuture.Someonewhois14yearsoldin2014willbe20yearsoldin2020.Therefore,oneestimateofthenumberof20yearoldsin2020isthenumberof14yearoldsin2014.Betweentheagesof1and50,annualmortalityratesareverylow(lessthan0.5%formenandlessthan0.3%forwomen[1]).Immigrationalsoaffectspopulationchanges,butfornowwewillignoreitsinfluence.Let'sconsiderjustfemalesinthisanalysis.
females['AGE+6']=females['AGE']+6
females
ComputationalandInferentialThinking
67Tables
AGE 2014 AGE+6
0 1,930,493 6
1 1,938,870 7
2 1,935,270 8
3 1,956,572 9
4 1,959,950 10
5 1,961,391 11
6 2,024,024 12
7 2,031,760 13
8 2,014,402 14
9 2,009,560 15
...(91rowsomitted)
Inordertorelatetheagein2014totheagein2020,wewilljointhistablewithitself,matchingvaluesintheAGEcolumnwithvaluesintheAGEin2020column.
joined=females.join('AGE',females,'AGE+6')
joined
AGE 2014 AGE+6 AGE_2 2014_2
6 2024024 12 0 1930493
7 2031760 13 1 1938870
8 2014402 14 2 1935270
9 2009560 15 3 1956572
10 2015380 16 4 1959950
11 2001949 17 5 1961391
12 1993547 18 6 2024024
13 2041159 19 7 2031760
14 2068252 20 8 2014402
15 2035299 21 9 2009560
...(85rowsomitted)
ComputationalandInferentialThinking
68Tables
Theresultingtablehasfivecolumns.Thefirstthreearethesameasbefore.ThetwonewcolumsarethevaluesforAGEand2014thatappearedinadifferentrow,theoneinwhichthatAGEappearedintheAGE+6column.Forinstance,thefirstrowcontainsthenumberof6yearoldsin2014andanestimateofthenumberof6yearoldsin2020(whowere0in2014).Somerelabelingandselectingmakesthistablemoreclear.
future=joined.select(['AGE','2014','2014_2']).relabeled('2014_2'
future
AGE 2014 2020(est)
6 2024024 1930493
7 2031760 1938870
8 2014402 1935270
9 2009560 1956572
10 2015380 1959950
11 2001949 1961391
12 1993547 2024024
13 2041159 2031760
14 2068252 2014402
15 2035299 2009560
...(85rowsomitted)
AddingaChangecolumnandsortingbythatchangedescribessomeofthemajorchangesinagecategoriesthatwecanexpectintheUnitedStatesforpeopleunder50.Accordingtothissimplisticanalysis,therewillbesubstantiallymorepeopleintheirlate30'sby2020.
predictions=future.where(future['AGE']<50)
predictions['Change']=predictions['2020(est)']-predictions['2014'
predictions.sort('Change',descending=True)
ComputationalandInferentialThinking
69Tables
AGE 2014 2020(est) Change
40 1940627 2170440 229813
38 1936610 2154232 217622
30 2110672 2301237 190565
37 1995155 2148981 153826
39 1993913 2135416 141503
29 2169563 2298701 129138
35 2046713 2169563 122850
36 2009423 2110672 101249
28 2144666 2244480 99814
41 1977497 2046713 69216
...(34rowsomitted)
ComputationalandInferentialThinking
70Tables
FunctionsInteractWearebuildingupausefulinventoryoftechniquesforidentifyingpatternsandthemesinadataset.Sortingandgroupingrowsofatablecanfocusourattention.Thenextapproachtoanalysiswewillconsiderinvolvesgroupingrowsofatablebyarbitrarycriteria.Todoso,wewillexploretwocorefeaturesofthePythonprogramminglanguage:functiondefinitionandconditionalstatements.
Wehaveusedfunctionsextensivelyalreadyinthistext,butneverdefinedafunctionofourown.Thepurposeofdefiningafunctionistogiveanametoacomputationalprocessthatmaybeappliedmultipletimes.Therearemanysituationsincomputingthatrequirerepeatedcomputation.Forexample,itisoftenthecasethatwewillwanttoperformthesamemanipulationoneveryvalueinacolumn.
AfunctionisdefinedinPythonusingadefstatement,whichisamulti-linestatementthatbeginswithaheaderlinegivingthenameofthefunctionandnamesfortheargumentsofthefunction.Therestofthedefstatement,calledthebody,mustbeindentedbelowtheheader.
Afunctionexpressesarelationshipbetweenitsinputs(calledarguments)anditsoutputs(calledreturnvalues).Thenumberofargumentsrequiredtocallafunctionisthenumberofnamesthatappearwithinparenthesesinthedefstatementheader.Thevaluesthatarereturneddependonthebody.Wheneverafunctioniscalled,itsbodyisexecuted.Wheneverareturnstatementwithinthebodyisexecuted,thecalltothefunctioncompletesandthevalueofthereturnexpressionisreturnedasthevalueofthefunctioncall.
Thedefinitionofthepercentfunctionbelowmultipliesanumberby100androundstheresulttotwodecimalplaces.
defpercent(x):
returnround(100*x,2)
Theprimarydifferencebetweendefiningapercentfunctionandsimplyevaluatingitsreturnexpression,round(100*x,2),isthatwhenafunctionisdefined,itsreturnexpressionisnotimmediatelyevaluated.Itcannotbe,becausethevalueforxisnotyetdefined.Instead,thereturnexpressionisevaluatedwheneverthispercentfunctioniscalledbyplacingparenthesesafterthenamepercentandplacinganexpressiontocomputeitsargumentinparentheses.
ComputationalandInferentialThinking
71Functions
percent(1/6)
16.67
percent(1/6000)
0.02
percent(1/60000)
0.0
Thethreeexpressionsaboveareallcallexpressions.Inthefirst,thevalueof1/6iscomputedandthenpassedastheargumentnamedxtothepercentfunction.Whenthepercentfunctioniscalledinthisway,itsbodyisexecuted.Thebodyofpercenthasonlyasingleline:returnround(100*x,2).Executingthisreturnstatementcompletesexecutionofthepercentfunction'sbodyandcomputesthevalueofthecallexpression.
Thesameresultiscomputedbypassinganamedvalueasanargument.Thepercentfunctiondoesnotknoworcarehowitsargumentiscomputedorstored;itsonlyjobistoexecuteitsownbodyusingtheargumentspassedtoit.
sixth=1/6
percent(sixth)
16.67
ConditionalStatements.Thebodyofafunctioncanhavemorethanonelineandmorethanonereturnstatement.Aconditionalstatementisamulti-linestatementthatallowsPythontochooseamongdifferentalternativesbasedonthetruthvalueofanexpression.Whileconditionalstatementscanappearanywhere,theyappearmostoftenwithinthebodyofafunctioninordertoexpressalternativebehaviordependingonargumentvalues.
ComputationalandInferentialThinking
72Functions
Aconditionalstatementalwaysbeginswithanifheader,whichisasinglelinefollowedbyanindentedbody.Thebodyisonlyexecutediftheexpressiondirectlyfollowingif(calledtheifexpression)evaluatestoatruevalue.Iftheifexpressionevaluatestoafalsevalue,thenthebodyoftheifisskipped.
Forexample,wecanimproveourpercentfunctionsothatitdoesn'troundverysmallnumberstozerosoreadily.Thebehaviorofpercent(1/6)isunchanged,butpercent(1/60000)providesamoreusefulresult.
defpercent(x):
ifx<0.00005:
return100*x
returnround(100*x,2)
percent(1/6)
16.67
percent(1/6000)
0.02
percent(1/60000)
0.0016666666666666668
Aconditionalstatementcanalsohavemultipleclauseswithmultiplebodies,andonlyoneofthosebodiescaneverbeexecuted.Thegeneralformatofamulti-clauseconditionalstatementappearsbelow.
ComputationalandInferentialThinking
73Functions
if<ifexpression>:
<ifbody>
elif<elifexpression0>:
<elifbody0>
elif<elifexpression1>:
<elifbody1>
...
else:
<elsebody>
Thereisalwaysexactlyoneifclause,buttherecanbeanynumberofelifclauses.Pythonwillevaluatetheifandelifexpressionsintheheadersinorderuntiloneisfoundthatisatruevalue,thenexecutethecorrespondingbody.Theelseclauseisoptional.Whenanelseheaderisprovided,itselsebodyisexecutedonlyifnoneoftheheaderexpressionsofthepreviousclausesaretrue.Theelseclausemustalwayscomeattheend(ornotatall).
Letuscontinuetorefineourpercentfunction.Perhapsforsomeanalysis,anyvaluebelowshouldbeconsideredcloseenoughto0thatitcanbeignored.Thefollowingfunction
definitionhandlesthiscaseaswell.
defpercent(x):
ifx<1e-8:
return0.0
elifx<0.00005:
return100*x
else:
returnround(100*x,2)
percent(1/6)
16.67
percent(1/6000)
0.02
ComputationalandInferentialThinking
74Functions
percent(1/60000)
0.0016666666666666668
percent(1/60000000000)
0.0
Awell-composedfunctionhasanamethatevokesitsbehavior,aswellasadocstring—adescriptionofitsbehaviorandexpectationsaboutitsarguments.Thedocstringcanalsoshowexamplecallstothefunction,wherethecallisprecededby>>>.
Adocstringcanbeanystringthatimmediatelyfollowstheheaderlineofadefstatement.Docstringsaretypicallydefinedusingtriplequotationmarksatthestartandend,whichallowsthestringtospanmultiplelines.Thefirstlineisconventionallyacompletebutshortdescriptionofthefunction,whilefollowinglinesprovidefurtherguidancetofutureusersofthefunction.
Amorecompletedefinitionofpercentthatincludesadocstringappearsbelow.
ComputationalandInferentialThinking
75Functions
defpercent(x):
"""Convertxtoapercentagebymultiplyingby100.
Percentagesareconventionallyroundedtotwodecimalplaces,
butprecisionisretainedforanyxabove1e-8thatwould
otherwiseroundto0.
>>>percent(1/6)
16.67
>>>perent(1/6000)
0.02
>>>perent(1/60000)
0.0016666666666666668
>>>percent(1/60000000000)
0.0
"""
ifx<1e-8:
return0.0
elifx<0.00005:
return100*x
else:
returnround(100*x,2)
ComputationalandInferentialThinking
76Functions
FunctionsandTablesInteractInthissection,wewillcontinuetousetheUSCensusdataset.Wewillfocusonlyonthe2014populationestimate.
census_2014=census.select(['SEX','AGE','2014']).where(census.column
census_2014
SEX AGE 2014
0 0 3,948,350
0 1 3,962,123
0 2 3,957,772
0 3 4,005,190
0 4 4,003,448
0 5 4,004,858
0 6 4,134,352
0 7 4,154,000
0 8 4,119,524
0 9 4,106,832
...(293rowsomitted)
Functionscanbeusedtocomputenewcolumnsinatablebasedonexistingcolumnvalues.Forexample,wecantransformthecodesintheSEXcolumntostringsthatareeasiertointerpret.
defmale_female(code):
ifcode==0:
return'Total'
elifcode==1:
return'Male'
elifcode==2:
return'Female'
ComputationalandInferentialThinking
77FunctionsandTables
Thisfunctiontakesanindividualcode—0,1,or2—andreturnsastringthatdescribesitsmeaning.
male_female(0)
'Total'
male_female(1)
'Male'
male_female(2)
'Female'
Wecouldalsotransformagesintoagecategories.
defage_group(age):
ifage<2:
return'Baby'
elifage<13:
return'Child'
elifage<20:
return'Teen'
else:
return'Adult'
age_group(15)
'Teen'
Apply.Theapplymethodofatablecallsafunctiononeachelementofacolumn,forminganewarrayofreturnvalues.Toindicatewhichfunctiontocall,justnameit(withoutquotationmarks).Thenameofthecolumnofinputvaluesmuststillappearwithinquotation
ComputationalandInferentialThinking
78FunctionsandTables
marks.
census_2014.apply(male_female,'SEX')
array(['Total','Total','Total',...,'Female','Female','Female'],
dtype='<U6')
Thisarray,whichhasthesamelengthastheoriginalSEXcolumnofthepopulationtable,canbeusedasthevaluesinanewcolumncalledMale/FemalealongsidetheexistingAGEandPopulationcolumns.
population=Table().with_columns([
'Male/Female',census_2014.apply(male_female,'SEX'),
'AgeGroup',census_2014.apply(age_group,'AGE'),
'Population',census_2014.column('2014')
])
population
Male/Female AgeGroup Population
Total Baby 3948350
Total Baby 3962123
Total Child 3957772
Total Child 4005190
Total Child 4003448
Total Child 4004858
Total Child 4134352
Total Child 4154000
Total Child 4119524
Total Child 4106832
...(293rowsomitted)
Groups.Thegroupmethodwithasingleargumentcountsthenumberofrowsforeachcategoryinacolumn.Theresultcontainsonerowperuniquevalueinthegroupedcolumn.
ComputationalandInferentialThinking
79FunctionsandTables
population.group('AgeGroup')
AgeGroup count
Adult 243
Baby 6
Child 33
Teen 21
Theoptionalsecondargumentnamesthefunctionthatwillbeusedtoaggregatevaluesinothercolumnsforallofthoserows.Forinstance,sumwillsumupthepopulationsinallrowsthatmatcheachcategory.Thisresultalsocontainsonerowperuniquevalueinthegroupedcolumn,butithasthesamenumberofcolumnsastheoriginaltable.
totals=population.where(0,'Total').select(['AgeGroup','Population'
totals
AgeGroup Population
Baby 3948350
Baby 3962123
Child 3957772
Child 4005190
Child 4003448
Child 4004858
Child 4134352
Child 4154000
Child 4119524
Child 4106832
...(91rowsomitted)
totals.group('AgeGroup',sum)
ComputationalandInferentialThinking
80FunctionsandTables
AgeGroup Populationsum
Adult 236721454
Baby 7910473
Child 44755656
Teen 29469473
Thegroupsmethodbehavesinthesameway,butacceptsalistofcolumnsasitsfirstargument.Theresultingtablehasonerowforeveryuniquecombinationofvaluesthatappeartogetherinthegroupedcolumns.Again,asingleargument(alist,inthiscase)givesrowcounts.
population.groups(['Male/Female','AgeGroup'])
Male/Female AgeGroup count
Female Adult 81
Female Baby 2
Female Child 11
Female Teen 7
Male Adult 81
Male Baby 2
Male Child 11
Male Teen 7
Total Adult 81
Total Baby 2
...(2rowsomitted)
Asecondargumenttogroupsaggregatesallothercolumnsthatdonotappearinthelistofgroupedcolumns.
population.groups(['Male/Female','AgeGroup'],sum)
ComputationalandInferentialThinking
81FunctionsandTables
Male/Female AgeGroup Populationsum
Female Adult 121754366
Female Baby 3869363
Female Child 21903805
Female Teen 14393035
Male Adult 114967088
Male Baby 4041110
Male Child 22851851
Male Teen 15076438
Total Adult 236721454
Total Baby 7910473
...(2rowsomitted)
Pivot.Thepivotmethodiscloselyrelatedtothegroupsmethod:itgroupstogetherrowsthatshareacombinationofvalues.Itdiffersfromgroupsbecauseitorganizestheresultingvaluesinagrid.Thefirstargumenttopivotisacolumnthatcontainsthevaluesthatwillbeusedtoformnewcolumnsintheresult.Thesecondargumentisacolumnusedforgroupingrows.Theresultgivesthecountofallrowsthatsharethecombinationofcolumnandrowvalues.
population.pivot('Male/Female','AgeGroup')
AgeGroup Female Male Total
Adult 81 81 81
Baby 2 2 2
Child 11 11 11
Teen 7 7 7
Anoptionalthirdargumentindicatesacolumnofvaluesthatwillreplacethecountsineachcellofthegrid.Thefourthargumentindicateshowtoaggregateallofthevaluesthatmatchthecombinationofcolumnandrowvalues.
pivoted=population.pivot('Male/Female','AgeGroup','Population'
pivoted
ComputationalandInferentialThinking
82FunctionsandTables
AgeGroup Female Male Total
Adult 121754366 114967088 236721454
Baby 3869363 4041110 7910473
Child 21903805 22851851 44755656
Teen 14393035 15076438 29469473
Theadvantageofpivotisthatitplacesgroupedvaluesintoadjacentcolumns,sothattheycanbecombined.Forinstance,thispivotedtableallowsustocomputetheproportionofeachagegroupthatismale.Wefindthesurprisingresultthatyoungeragegroupsarepredominantlymale,butamongadultstherearesubstantiallymorefemales.
pivoted.with_column('MalePercentage',pivoted.column('Male')/pivoted
AgeGroup Female Male Total MalePercentage
Adult 121754366 114967088 236721454 48.57%
Baby 3869363 4041110 7910473 51.09%
Child 21903805 22851851 44755656 51.06%
Teen 14393035 15076438 29469473 51.16%
ComputationalandInferentialThinking
83FunctionsandTables
Chapter2
VisualizationsWhilewehavebeenabletomakeseveralinterestingobservationsaboutdatabysimplyrunningoureyesdownthenumbersinatable,ourtaskbecomesmuchharderastablesbeenlarger.Indatascience,apicturecanbeworthathousandnumbers.
Wesawanexampleofthisearlierinthetext,whenweexaminedJohnSnow'smapofcholeradeathsinLondonin1854.
Snowshowedeachdeathasablackmarkatthelocationwherethedeathoccurred.Indoingso,heplottedthreevariables—thenumberofdeaths,aswellastwocoordinatesforeachlocation—inasinglegraphwithoutanyornamentsorflourishes.Thesimplicityofhis
ComputationalandInferentialThinking
84Visualization
presentationfocusestheviewer'sattentiononhismainpoint,whichwasthatthedeathswerecenteredaroundtheBroadStreetpump.
In1869,aFrenchcivilengineernamedCharlesJosephMinardcreatedwhatisstillconsideredoneofthegreatestgraphofalltime.ItshowsthedecimationofNapoleon'sarmyduringitsretreatfromMoscow.In1812,NapoleonhadsetouttoconquerRussia,withover400,000meninhisarmy.TheydidreachMoscow,butwereplaguedbylossesalongtheway;theRussianarmykeptretreatingfartherandfartherintoRussia,deliberatelyburningfieldsanddestroyingvillagesasitretreated.ThislefttheFrencharmywithoutfoodorshelterasthebrutalRussianwinterbegantosetin.TheFrencharmyturnedbackwithoutadecisivevictoryinMoscow.Theweathergotcolder,andmoremendied.Only10,000returned.
ThegraphisdrawnoveramapofeasternEurope.ItstartsatthePolish-Russianborderattheleftend.ThelightbrownbandrepresentsNapoleon'sarmymarchingtowardsMoscow,andtheblackbandrepresentsthearmyreturning.Ateachpointofthegraph,thewidthofthebandisproportionaltothenumberofsoldiersinthearmy.Atthebottomofthegraph,Minardincludesthetemperaturesonthereturnjourney.
Noticehownarrowtheblackbandbecomesasthearmyheadsback.ThecrossingoftheBerezinariverwasparticularlydevastating;canyouspotitonthegraph?
Thegraphisremarkableforitssimplicityandpower.Inasinglegraph,Minardshowssixvariables:
thenumberofsoldiersthedirectionofthemarchthetwocoordinatesoflocationthetemperatureonthereturnjourneythelocationonspecificdatesinNovemberandDecember
ComputationalandInferentialThinking
85Visualization
EdwardTufte,ProfessoratYaleandoneoftheworld'sexpertsonvisualizingquantitativeinformation,saysthatMinard'sgraphis"probablythebeststatisticalgraphiceverdrawn."
Thetechnologyofourtimesallowstoincludeanimationandcolor.Usedjudiciously,withoutexcess,thesecanbeextremelyinformative,asinthisanimationbyGapminder.orgofannualcarbondioxideemissionsintheworldoverthepasttwocenturies.
Acaptionsays,"ClickPlaytoseehowUSAbecomesthelargestemitterofCO2from1900onwards."ThegraphoftotalemissionsshowsthatChinahashigheremissionsthantheUnitedStates.However,inthegraphofpercapitaemissions,theUSishigherthanChina,becauseChina'spopulationismuchlargerthanthatoftheUnitedStates.
Technologycanbeahelpaswellasahindrance.Sometimes,theabilitytocreateafancypictureleadstoalackofclarityinwhatisbeingdisplayed.Inaccuraterepresentationofnumericalinformation,inparticular,canleadtomisleadingmessages.
Here,fromtheWashingtonPostintheearly1980s,isagraphthatattemptstocomparetheearningsofdoctorswiththeearningsofotherprofessionalsoverafewdecades.Dowereallyneedtoseetwoheads(onewithastethoscope)oneachbar?Tuftscoinedtheterm"chartjunk"forsuchunnecessaryembellishments.Healsodeploresthe"lowdata-to-inkratio"whichthisgraphunfortunatelypossesses.
ComputationalandInferentialThinking
86Visualization
Mostimportantly,thehorizontalaxisofthegraphisisnotdrawntoscale.Thishasasignificanteffectontheshapeofthebargraphs.Whendrawntoscaleandshornofdecoration,thegraphsrevealtrendsthatarequitedifferentfromtheapparentlylineargrowthintheoriginal.TheelegantgraphbelowisduetoRossIhaka,oneoftheoriginatorsofthestatisticalsystemR.
ComputationalandInferentialThinking
87Visualization
ComputationalandInferentialThinking
88Visualization
BarChartsInteract
CategoricalDataandBarCharts¶Nowthatwehaveexaminedseveralgraphicsproducedbyothers,itistimetoproducesomeofourown.Wewillstartwithbarcharts,atypeofgraphwithwhichyoumightalreadybefamiliar.Abarchartshowsthedistributionofacategoricalvariable,thatis,avariablewhosevaluesarecategories.Inhumanpopulations,examplesofcategoricalvariablesincludegender,ethnicity,maritalstatus,countryofcitizenship,andsoon.
Abarchartconsistsofasequenceofrectangularbars,onecorrespondingtoeachcategory.Thelengthofeachbarisproportionaltothenumberofentriesinthecorrespondingcategory.
TheKaiserFamilyFoundationhascompliedCensusdataonthedistributionofraceandethnicityintheUnitedStates.
http://kff.org/other/state-indicator/distribution-by-raceethnicity/
http://kff.org/other/state-indicator/children-by-raceethnicity/
Herearesomeofthedata,arrangedbystate.Thetablechildrencontainsdataforpeoplewhowereyoungerthan18yearsoldin2014;everyonecontainsthedataforpeopleofallagesin2014.Somemissingvaluesinthesetablesarerepresentedas0.
everyone=Table.read_table('kaiser_ethnicity_everyone.csv')
children=Table.read_table('kaiser_ethnicity_children.csv')
children.set_format(2,NumberFormatter)
everyone.set_format(2,NumberFormatter)
ComputationalandInferentialThinking
89BarCharts
State Race/Ethnicity Population
Alabama White 3,167,600
Alabama Black 1,269,200
Alabama Hispanic 191,000
Alabama Asian 77,300
Alabama AmericanIndian/AlaskaNative 0
Alabama TwoOrMoreRaces 56,100
Alabama Total 4,768,000
Alaska White 396,400
Alaska Black 17,000
Alaska Hispanic 59,200
...(347rowsomitted)
Themethodbarhtakestwoarguments,acolumncontainingcategoriesandacolumncontainingnumericquantities.Itgeneratesahorizontalbarchart.ThischartfocusesonCalifornia.
everyone.where('State','California').barh('Race/Ethnicity','Population'
ThetableofchildrenalsocontainsinformationaboutCalifornia,althoughtherearefewercategoriesofrace/ethnicity.
children.where('State','California')
ComputationalandInferentialThinking
90BarCharts
State Race/Ethnicity Population
California White 2,786,000
California Black 484,200
California Hispanic 4,849,400
California Other 1,590,100
California Total 9,709,700
Again,wecandispalythesedatainahorizontalbarchart.Thebarhmethodcantakecolumnindicesaswellaslabels.Thischartlookssimilar,butinvolvesfewercategoriesbecausetheKaiserFamilyFoundationonlyprovidesthesedataaboutchildren.
children.where('State','California').barh(1,2)
It'snotobvioushowtocomparethesetwocharts.Tohelpusdrawconclusions,wewillcombinethesetwotablesintoone,whichcontainsbotheveryoneandchildrenasseparatecolumns.Thistransformationwillrequireseveralsteps.First,toensurethatvaluesarecomparable,wewilldivideeachpopulationbythetotalforthestatetocomputestateproportions.
totals=everyone.join('State',everyone.where(1,'Total').relabeled
totals
ComputationalandInferentialThinking
91BarCharts
State Race/Ethnicity Population StateTotal
Alabama White 3167600 4768000
Alabama Black 1269200 4768000
Alabama Hispanic 191000 4768000
Alabama Asian 77300 4768000
Alabama AmericanIndian/AlaskaNative 0 4768000
Alabama TwoOrMoreRaces 56100 4768000
Alabama Total 4768000 4768000
Alaska White 396400 695700
Alaska Black 17000 695700
Alaska Hispanic 59200 695700
...(347rowsomitted)
proportions=everyone.select([0,1]).with_column('Proportion',totals
proportions.set_format(2,PercentFormatter)
State Race/Ethnicity Proportion
Alabama White 66.43%
Alabama Black 26.62%
Alabama Hispanic 4.01%
Alabama Asian 1.62%
Alabama AmericanIndian/AlaskaNative 0.00%
Alabama TwoOrMoreRaces 1.18%
Alabama Total 100.00%
Alaska White 56.98%
Alaska Black 2.44%
Alaska Hispanic 8.51%
...(347rowsomitted)
Thissametransformationcanbeappliedtothechildrentable.Below,wechaintogetherseveraltablemethodsinordertoexpressthetransformationinasinglelongline.Typically,it'sbesttoexpresstransformationsinpieces(above),butprogramminglanguagesarequite
ComputationalandInferentialThinking
92BarCharts
flexibleinallowingprogrammerstocombineexpressions(below).
children_proportions=children.select([0,1]).with_column(
'ChildrenProportion',children.column(2)/children.join('State'
children_proportions.set_format(2,PercentFormatter)
State Race/Ethnicity ChildrenProportion
Alabama White 59.99%
Alabama Black 30.50%
Alabama Hispanic 6.41%
Alabama Other 3.09%
Alabama Total 100.00%
Alaska White 46.00%
Alaska Black 2.60%
Alaska Hispanic 11.99%
Alaska Other 39.36%
Alaska Total 100.00%
...(245rowsomitted)
Thefollowingfunctiontakesthenameofastate.Itjoinstogethertheproportionsforeveryoneandchildrenintoasingletable.Whenthejoinmethodcombinestwocolumnswithdifferentvalues,onlythematchingelementsareretained.Inthiscase,anyrace/ethnicitynotrepresentedinbothtableswillbeexcluded.
defcompare(state):
prop_for_state=proportions.where('State',state).drop(0)
children_prop_for_state=children_proportions.where('State',state
joined=prop_for_state.join('Race/Ethnicity',children_prop_for_state
joined.set_format([1,2],PercentFormatter)
returnjoined.sort('Proportion')
compare('California')
ComputationalandInferentialThinking
93BarCharts
Race/Ethnicity Proportion ChildrenProportion
Black 5.34% 4.99%
Hispanic 38.21% 49.94%
White 39.20% 28.69%
Total 100.00% 100.00%
Thebarhmethodcandisplaythedatafrommultiplecolumnsonasinglechart.Whencalledwithonlyoneargumentindicatingthecolumnofcategories,allothercolumnsaredisplayedassetsofbars.
compare('California').barh('Race/Ethnicity')
ThischartshowsthechangingcompositionofthestateofCalifornia.In2014,fewerthan40%ofCalifornianswereHispanic.But50%ofthechildrenwere,whichisthecrucialingredientinpredictionsthatHispanicswilleventuallybeamajorityinCalifornia.
Whataboutotherstates?Ourfunctionallowsquickvisualcomparisonsforanystate.
compare('Minnesota').barh('Race/Ethnicity')
ComputationalandInferentialThinking
94BarCharts
compare('Vermont').barh('Race/Ethnicity')
ComputationalandInferentialThinking
95BarCharts
HistogramsInteract
QuantitativeDataandHistograms¶Manyofthevariablesthatdatascientistsstudyarequantitative.Forinstance,wecanstudytheamountofrevenueearnedbymoviesinrecentdecades.OursourceistheInternetMovieDatabase,anonlinedatabasethatcontainsinformationaboutmovies,televisionshows,videogames,andsoon.
Thetabletopconsistsof[U.S.A.'stopgrossingmovies(http://www.boxofficemojo.com/alltime/adjusted.htm)ofalltime.Thefirstcolumncontainsthetitleofthemovie;StarWars:TheForceAwakenshasthetoprank,withaboxofficegrossamountofmorethan900milliondollarsintheUnitedStates.Thesecondcolumncontainsthenameofthestudio;thethirdcontainstheU.S.boxofficegrossindollars;thefourthcontainsthegrossamountthatwouldhavebeenearnedfromticketsalesat2016prices;andthefifthcontainsthereleaseyearofthemovie.
Thereare200moviesonthelist.Herearethetopten.
top=Table.read_table('top_movies.csv')
top.set_format([2,3],NumberFormatter)
ComputationalandInferentialThinking
96Histograms
Title Studio Gross Gross(Adjusted) Year
StarWars:TheForceAwakens
BuenaVista(Disney) 906,723,418 906,723,400 2015
Avatar Fox 760,507,625 846,120,800 2009
Titanic Paramount 658,672,302 1,178,627,900 1997
JurassicWorld Universal 652,270,625 687,728,000 2015
Marvel'sTheAvengers BuenaVista(Disney) 623,357,910 668,866,600 2012
TheDarkKnight WarnerBros. 534,858,444 647,761,600 2008
StarWars:EpisodeI-ThePhantomMenace Fox 474,544,677 785,715,000 1999
StarWars Fox 460,998,007 1,549,640,500 1977
Avengers:AgeofUltron BuenaVista(Disney) 459,005,868 465,684,200 2015
TheDarkKnightRises WarnerBros. 448,139,099 500,961,700 2012
...(190rowsomitted)
Three-digitnumbersareeasiertoworkwiththannine-digitnumbers.So,wewilladdanadditionalmillionscolumncontainingtheadjustedgrossboxofficerevenueinmillions.
millions=top.select(0).with_column('Gross',np.round(top.column(3
millions
ComputationalandInferentialThinking
97Histograms
Title Gross
StarWars:TheForceAwakens 906.72
Avatar 846.12
Titanic 1178.63
JurassicWorld 687.73
Marvel'sTheAvengers 668.87
TheDarkKnight 647.76
StarWars:EpisodeI-ThePhantomMenace 785.72
StarWars 1549.64
Avengers:AgeofUltron 465.68
TheDarkKnightRises 500.96
...(190rowsomitted)
Thehistmethodgeneratesahistogramofthevaluesinacolumn.Theoptionalunitargumentisusedtolabeltheaxes.
millions.hist('Gross',unit="MillionDollars")
Thefigureaboveshowsthedistributionoftheamountsgrossed,inmillionsofdollars.Theamountshavebeengroupedintocontiguousintervalscalledbins.Althoughinthisdatasetnomoviegrossedanamountthatisexactlyontheedgebetweentwobins,itisworthnotingthathisthasanendpointconvention:binsincludethedataattheirleftendpoint,butnotthedataattheirrightendpoint.Sometimes,adjustmentshavetobemadeinthefirstorlastbin,
ComputationalandInferentialThinking
98Histograms
toensurethatthesmallestandlargestvaluesofthevariableareincluded.YousawanexampleofsuchanadjustmentintheCensusdatausedintheTablessection,whereanageof"100"yearsactuallymeant"100yearsoldorolder."
Wecanseethatthereare10bins(somebarsaresolowthattheyarehardtosee),andthattheyallhavethesamewidth.Wecanalsoseethattherethelistcontainsnomoviethatgrossedfewerthan300milliondollars;thatisbecauseweareconsideringonlythetopgrossingmoviesofalltime.Itisalittlehardertoseeexactlywheretheedgesofthebinsareplaced.Forexample,itisnotclearexactlywherethevalue500liesonthehorizontalaxis,andsoitishardtojudgeexactlywherethefirstbarendsandthesecondbegins.
Theoptionalargumentbinscanbeusedwithhisttospecifytheedgesofthebars.Itmustconsistofasequenceofnumbersthatincludestheleftendofthefirstbarandtherightendofthelastbar.Asthehighestgrossamountissomewhatover760onthehorizontalscale,wewillstartbysettingbinstobethearrayconsistingofthenumbers300,400,500,andsoon,endingwith2000.
millions.hist('Gross',unit="MillionDollars",bins=np.arange(300,2001
Thisfigureiseasiertoread.Onthehorizontalaxis,thelabels200,400,600,andsoonarecenteredatthecorrespondingvalues.Thetallestbarisformoviesthatgrossedbetween300andmillionand400million(adjusted)dollars;thebarfor400to500millionisabout5/8thashigh.Theheightofeachbarisproportionaltothenumberofmoviesthatfallwithinthex-axisrangeofthebar.
Averysmallnumbergrossedmorethan700milliondollars.Thisresultsinthefigurebeing"skewedtotheright,",or,lessformally,having"alongrighthandtail."Distributionsofvariableslikeincomeorrentoftenhavethiskindofshape.
ComputationalandInferentialThinking
99Histograms
Thecountsofvaluesindifferentrangescanbecomputedfromatableusingthebinmethod,whichtakesacolumnlabelorindexandanoptionalsequenceornumberofbins.Theresultisatabularformofahistogram;itliststhecountsofallvaluesinthe'Millions'columnthataregreaterthanorequaltotheindicatedbin,butlessthanthenextbin'sstartingvalue.
bins=millions.bin('Gross',bins=np.arange(300,2001,100))
bins
bin Grosscount
300 81
400 52
500 28
600 16
700 7
800 5
900 3
1000 1
1100 3
1200 2
...(8rowsomitted)
Theverticalaxisofahistogramisthedensityscale.Theheightofeachbaristheproportionofelementsthatfallintothecorrespondingbin,dividedbythewidthofthebin.Forexample,theheightofthebinfrom300to400,whichcontains81ofthe200movies,is
=0.405%.
starts=bins.column(0)
counts=bins.column(1)
counts.item(0)/sum(counts)/(starts.item(1)-starts.item(0))
0.0040500000000000006
ComputationalandInferentialThinking
100Histograms
Thetotalareaofthisbaris0.405,100timesgreaterbecauseitswidthis100.ThisareaistheproportionofallvaluesintheMillionscolumnthatfallintothe300-400bin.Theareaofallbarsinahistogramsumsto1.
Ahistogrammaycontainbinsofunequalsize,buttheareaofallbarswillstillsumto1.Below,thevaluesintheMillionscolumnarebinnedintofourcategories.
millions.hist('Gross',unit="MillionDollars",bins=[300,400,600,
Althoughtheranges300-400and400-600havenearlyidenticalcounts,theformeristwiceastallasthelatterbecauseitisonlyhalfaswide.Thedensityofvaluesinthatrangeistwiceashigh.Histogramsdisplayinvisualformwhereonthenumberlinevaluesaremostconcentrated.
millions.bin('Gross',bins=[300,400,600,1500])
bin Grosscount
300 81
400 80
600 37
1500 0
HistogramofCounts.Itispossibletodisplaycountsdirectlyinachart,usingthenormed=Falseoptionofthehistmethod.Theresultingcharthasthesameshapewhenbinsallhaveequalwidths,butthebarareasdonotsumto1.
ComputationalandInferentialThinking
101Histograms
millions.hist('Gross',bins=np.arange(300,2001,100),normed=False)
Whilethecountscaleisperhapsmorenaturaltointerpretthanthedensityscale,thechartbecomeshighlymisleadingwhenbinshavedifferentwidths.Below,itappears(duetothecountscale)thathigh-grossingmoviesarequitecommon,wheninfactwehaveseenthattheyarerelativelyrare.
millions.hist('Gross',bins=[300,400,500,600,1500],normed=False
Eventhoughthemethodusediscalledhist,thefigureaboveisNOTAHISTOGRAM.Itgivestheimpressionthatmoviesareevenlydistributedinthe300-600range,buttheyarenot.Moreover,itexaggeratesthedensityofmoviesgrossingmorethan600million.The
ComputationalandInferentialThinking
102Histograms
heightofeachbarissimplyplottedthenumberofmoviesinthebin,withoutaccountingforthedifferenceinthewidthsofthebins.Thepicturebecomesevenmoreskewedifthelasttwobinsarecombined.
millions.hist('Gross',bins=[300,400,1500],normed=False)
Inthiscount-basedhistogram,theshapeofthedistributionofmoviesislostentirely.
Sowhatisahistogram?
Thefigureaboveshowsthatwhattheeyeperceivesas"big"isarea,notjustheight.Thisisparticularlyimportantwhenthebinsareofdifferentwidths.
Thatiswhyahistogramhastwodefiningproperties:
1. Thebinsarecontiguous(thoughsomemightbeempty)andaredrawntoscale.2. Theareaofeachbarisproportionaltothenumberofentriesinthebin.
Property2isthekeytodrawingahistogram,andisusuallyachievedasfollows:
Whendrawnusingthismethod,thehistogramissaidtobedrawnonthedensityscale,andthetotalareaofthebarsisequalto1.
Tocalculatetheheightofeachbar,usethefactthatthebarisarectangle:
andso
ComputationalandInferentialThinking
103Histograms
Thelevelofdetail,andtheflattopsofthebars
Eventhoughthedensityscalecorrectlyrepresentsproportionsusingarea,somedetailislostbyplacingdifferentvaluesintothesamebins.
Takeanotherlookatthe[300,400)bininthefigurebelow.Theflattopofthebar,atalevelnear0.4%,hidesthefactthatthemoviesaresomewhatunevenlydistributedacrossthebin.
millions.hist('Gross',bins=[300,400,600,1500])
Toseethis,letussplitthe[150,200)binintotennarrowerbinsofwidth10milliondollarseach:
millions.hist('Gross',bins=[300,310,320,330,340,350,360,370
ComputationalandInferentialThinking
104Histograms
Someoftheskinnybarsaretallerthan0.4%andothersareshorter.Byputtingaflattopat0.4%overthewholebin,wearedecidingtoignorethefinerdetailandusetheflatlevelasaroughapproximation.Often,thoughnotalways,thisissufficientforunderstandingthegeneralshapeofthedistribution.
Noticethatbecausewehavetheentiredataset,wecandrawthehistograminasfinealevelofdetailasthedataandourpatiencewillallow.However,ifyouarelookingatahistograminabookoronawebsite,andyoudon'thaveaccesstotheunderlyingdataset,thenitbecomesimportanttohaveaclearunderstandingofthe"roughapproximation"oftheflattops.
Thedensityscale
Theheightofeachbarisaproportiondividedbyabinwidth.Thus,forthisdatset,thevaluesontheverticalaxisare"proportionspermilliondollars."Tounderstandthisbetter,lookagainatthe[150,200)bin.Thebinis50milliondollarswide.Sowecanthinkofitasconsistingof50narrowbinsthatareeach1milliondollarswide.Thebar'sheightofroughly"0.004permilliondollars"meansthatineachofthose50skinnybinsofwidth1milliondollars,theproportionofmoviesisroughly0.004.
Thustheheightofahistogrambarisaproportionperunitonthehorizontalaxis,andcanbethoughtofasthedensityofentriesperunitwidth.
millions.hist('Gross',bins=[300,350,400,450,1500])
ComputationalandInferentialThinking
105Histograms
DensityQ&A
Lookagainatthehistogram,andthistimecomparethe[400,450)binwiththe[450,1500)bin.
Q:Whichhasmoremoviesinit?
A:The[450,1500)bin.Ithas92movies,comparedwith25moviesinthe[400,450)bin.
Q:Thenwhyisthe[450,1500)barsomuchshorterthanthe[400,450)bar?
A:Becauseheightrepresentsdensityperunitwidth,notthenumberofmoviesinthebin.The[450,1500)binhasmoremoviesthanthe[400,450)bin,butitisalsoawholelotwider.Sothedensityismuchlower.
millions.bin('Gross',bins=[300,350,400,450,1500])
bin Grosscount
300 32
350 49
400 25
450 92
1500 0
Barchartorhistogram?
Barchartsdisplaythedistributionsofcategoricalvariables.Allthebarsinabarcharthavethesamewidth.Thelengths(orheights,ifthebarsaredrawnvertically)ofthebarsareproportionaltothenumberofentries.
ComputationalandInferentialThinking
106Histograms
Histogramsdisplaythedistributionsofquantitativevariables.Thebarscanhavedifferentwidths.Theareasofthebarsareproportionaltothenumberofentries.
Multiplehistograms
Inalltheexamplesinthissection,wehavedrawnasinglehistogram.However,ifadatatablecontainsseveralcolumns,thenbarhandhistcanbeusedtodrawseveralchartsatonce.Wewillcoverthisfeatureinalatersection.
ComputationalandInferentialThinking
107Histograms
Chapter3
SamplingSamplingisapowerfulcomputationaltechniqueindataanalysis.Itallowsustoinvestigatebothissuesofprobabilityandstatistics.Probabilityisthestudyoftheoutcomesofrandomprocesses;processesthatincludechance.Statisticsisjusttheopposite:itisthestudyofrandomprocessesgivenobservationsoftheiroutcomes.Putanotherway,probabilityfocusesonwhatwillbeobservedgivenwhatweknowabouttheworld.Statisticsfocusesonwhatwecaninferabouttheworldgivenwhatwehaveobserved.RandomSamplingisacentraltoolinbothdirectionsofinference.
ComputationalandInferentialThinking
108Sampling
SamplingInteractInthissection,wewillcontinuetousethetop_movies.csvdataset.
top=Table.read_table('top_movies.csv')
top.set_format([2,3],NumberFormatter)
Title Studio Gross Gross(Adjusted) Year
StarWars:TheForceAwakens
BuenaVista(Disney) 906,723,418 906,723,400 2015
Avatar Fox 760,507,625 846,120,800 2009
Titanic Paramount 658,672,302 1,178,627,900 1997
JurassicWorld Universal 652,270,625 687,728,000 2015
Marvel'sTheAvengers BuenaVista(Disney) 623,357,910 668,866,600 2012
TheDarkKnight WarnerBros. 534,858,444 647,761,600 2008
StarWars:EpisodeI-ThePhantomMenace Fox 474,544,677 785,715,000 1999
StarWars Fox 460,998,007 1,549,640,500 1977
Avengers:AgeofUltron BuenaVista(Disney) 459,005,868 465,684,200 2015
TheDarkKnightRises WarnerBros. 448,139,099 500,961,700 2012
...(190rowsomitted)
Samplingrowsofatable¶DeterministicSamples
Whenyousimplyspecifywhichelementsofasetyouwanttochoose,withoutanychancesinvolved,youcreateadeterministicsample.
ComputationalandInferentialThinking
109Sampling
Adeterminsiticsamplefromtherowsofatablecanbeconstructedusingthetakemethod.Itsargumentisasequenceofintegers,anditreturnsatablecontainingthecorrespondingrowsoftheoriginaltable.
Thecodebelowreturnsatableconsistingoftherowsindexed3,18,and100inthetabletop.Sincetheoriginaltableissortedbyrank,whichbeginsat1,thezero-indexedrowsalwayshavearankthatisonegreaterthantheindex.However,thetakemethoddoesnotinspecttheinformationintherowstoconstructitssample.
top.take([3,18,100])
Title Studio Gross Gross(Adjusted) Year
JurassicWorld Universal 652,270,625 687,728,000 2015
Spider-Man Sony 403,706,375 604,517,300 2002
GonewiththeWind MGM 198,676,459 1,757,788,200 1939
Wecanselectevenlyspacedrowsbycallingnp.arangeandpassingtheresulttotake.Intheexamplebelow,westartwiththefirstrow(index0)oftop,andchooseevery40throwafterthatuntilwereachtheendofthetable.
top.take(np.arange(0,top.num_rows,40))
Title Studio Gross Gross(Adjusted) Year
StarWars:TheForceAwakens BuenaVista(Disney) 906,723,418 906,723,400 2015
ShrektheThird Paramount/Dreamworks 322,719,944 408,090,600 2007
BruceAlmighty Universal 242,829,261 350,350,700 2003
ThreeMenandaBaby BuenaVista(Disney) 167,780,960 362,822,900 1987
SaturdayNightFever Paramount 94,213,184 353,261,200 1977
ProbabilitySamples¶Muchofdatascienceconsistsofmakingconclusionsbasedonthedatainrandomsamples.Correctlyinterpretinganalysesbasedonrandomsamplesrequiresdatascientiststoexamineexactlywhatrandomsamplesare.
ComputationalandInferentialThinking
110Sampling
Apopulationisthesetofallelementsfromwhomasamplewillbedrawn.
Aprobabilitysampleisoneforwhichitispossibletocalculate,beforethesampleisdrawn,thechancewithwhichanysubsetofelementswillenterthesample.
Inaprobabilitysample,allelementsneednothavethesamechanceofbeingchosen.Forexample,supposeyouchoosetwopeoplefromapopulationthatconsistsofthreepeopleA,B,andC,accordingtothefollowingscheme:
PersonAischosenwithprobability1.OneofPersonsBorCischosenaccordingtothetossofacoin:ifthecoinlandsheads,youchooseB,andifitlandstailsyouchooseC.
Thisisaprobabilitysampleofsize2.Herearethechancesofentryforallnon-emptysubsets:
A:1
B:1/2
C:1/2
AB:1/2
AC:1/2
BC:0
ABC:0
PersonAhasahigherchanceofbeingselectedthanPersonsBorC;indeed,PersonAiscertaintobeselected.Sincethesedifferencesareknownandquantified,theycanbetakenintoaccountwhenworkingwiththesample.
Todrawaprobabilitysample,weneedtobeabletochooseelementsaccordingtoaprocessthatinvolveschance.Abasictoolforthispurposeisarandomnumbergenerator.ThereareseveralinPython.Herewewilluseonethatispartofthemodulerandom,whichinturnispartofthemodulenumpy.
Themethodrandint,whengiventwoargumentslowandhigh,returnsanintegerpickeduniformlyatrandombetweenlowandhigh,includinglowbutexcludinghigh.Runthecodebelowseveraltimestoseethevariabilityintheintegersthatarereturned.
np.random.randint(3,8)#selectonceatrandomfrom3,4,5,6,7
7
ASystematicSample
ComputationalandInferentialThinking
111Sampling
Imaginealltheelementsofthepopulationlistedinasequence.Onemethodofsamplingstartsbychoosingarandompositionearlyinthelist,andthenevenlyspacedpositionsafterthat.Thesampleconsistsoftheelementsinthosepositions.Suchasampleiscalledasystematicsample.
Herewewillchooseasystematicsampleoftherowsoftop.Wewillstartbypickingoneofthefirst10rowsatrandom,andthenwewillpickevery10throwafterthat.
"""Choosearandomstartamongrows0through9;
thentakeevery10throw."""
start=np.random.randint(0,10)
top.take(np.arange(start,top.num_rows,10))
Title Studio Gross Gross(Adjusted) Year
StarWars Fox 460,998,007 1,549,640,500 1977
TheHungerGames Lionsgate 408,010,692 442,510,400 2012
ThePassionoftheChrist NM 370,782,930 519,432,100 2004
AliceinWonderland(2010)
BuenaVista(Disney) 334,191,110 365,718,600 2010
StarWars:EpisodeII-AttackoftheClones Fox 310,676,740 465,175,700 2002
TheSixthSense BuenaVista(Disney) 293,506,292 500,938,400 1999
TheHangover WarnerBros. 277,322,503 323,343,300 2009
RaidersoftheLostArk Paramount 248,159,971 770,183,000 1981
TheLostWorld:JurassicPark Universal 229,086,679 434,216,600 1997
AustinPowers:TheSpyWhoShaggedMe NewLine 206,040,086 352,863,900 1999
...(10rowsomitted)
Runthecodeafewtimestoseehowtheoutputvaries.
Thissystematicsampleisaprobabilitysample.Inthisscheme,allrowsdohavethesamechanceofbeingchosen.Butthatisnottrueofothersubsetsoftherows.Becausetheselectedrowsareevenlyspaced,mostsubsetsofrowshavenochanceofbeingchosen.
ComputationalandInferentialThinking
112Sampling
Theonlysubsetsthatarepossiblearethosethatconsistofrowsallseparatedbymultiplesof10.Anyofthosesubsetsisselectedwithchance1/10.
Randomsamplewithreplacement
Someofthesimplestprobabilitysamplesareformedbydrawingrepeatedly,uniformlyatrandom,fromthelistofelementsofthepopulation.Ifthedrawsaremadewithoutchangingthelistbetweendraws,thesampleiscalledarandomsamplewithreplacement.Youcanimaginemakingthefirstdrawatrandom,replacingtheelementdrawn,andthendrawingagain.
Inarandomsamplewithreplacement,eachelementinthepopulationhasthesamechanceofbeingdrawn,andeachcanbedrawnmorethanonceinthesample.
Ifyouwanttodrawasampleofpeopleatrandomtogetsomeinformationaboutapopulation,youmightnotwanttosamplewithreplacement–drawingthesamepersonmorethanoncecanleadtoalossofinformation.Butindatascience,randomsampleswithreplacementariseintwomajorareas:
studyingprobabilitiesbysimulatingtossesofacoin,rollsofadie,orgamblinggames
creatingnewsamplesfromasampleathand
Thesecondoftheseareaswillbecoveredlaterinthecourse.Fornow,letusstudysomelong-runpropertiesofprobabilities.
Wewillstartwiththetablediewhichcontainsthenumbersofspotsonthefacesofadie.Allthenumbersappearexactlyonce,asweareassumingthatthedieisfair.
die=Table().with_column('Face',[1,2,3,4,5,6])
die
Face
1
2
3
4
5
6
Drawingthehistogramofthissimplesetofnumbersyieldsanunsettlingfigureashistmakesadefaultchoiceofbins:
ComputationalandInferentialThinking
113Sampling
die.hist()
Thenumbers1,2,3,4,5,6areintegers,sothebinschosenbyhistonlyhaveentriesattheedges.Insuchasituation,itisabetterideatoselectbinssothattheyarecenteredontheintegers.Thisisoftentrueofhistogramsofdatathatarediscrete,thatis,variableswhosesuccessivevaluesareseparatedbythesamefixedamount.Therealadvantageofthismethodofbinselectionwillbecomemoreclearwhenwestartimaginingsmoothcurvesdrawnoverhistograms.
dice_bins=np.arange(0.5,7,1)
die.hist(bins=dice_bins)
Noticehoweachbinhaswidth1andiscenteredonaninteger.Noticealsothatbecausethewidthofeachbinis1,theheightofeachbaris ,thechancethatthecorrespondingface
appears.
ComputationalandInferentialThinking
114Sampling
Thehistogramshowstheprobabilitywithwhicheachfaceappears.Itiscalledaprobabilityhistogramoftheresultofonerollofadie.
Thehistogramwasdrawnwithoutrollinganydiceorgeneratinganyrandomnumbers.Wewillnowusethecomputertomimicactuallyrollingadie.Theprocessofusingacomputerprogramtoproducetheresultsofachanceexperimentiscalledsimulation.
Torollthedie,wewilluseamethodcalledsample.Thismethodreturnsanewtableconsistingofrowsselecteduniformlyatrandomfromatable.Itsfirstargumentisthenumberofrowstobereturned.Itssecondargumentiswhetherornotthesamplingshouldbedonewithreplacement.
Thecodebelowsimulates10rollsofthedie.Aswithallsimulationsofchanceexperiments,youshouldrunthecodeseveraltimesandnoticethevariabilityinwhatisreturned.
die.sample(10,with_replacement=True)
Face
5
4
2
3
6
5
1
1
6
2
Wewillnowrollthedieseveraltimesanddrawthehistogramoftheobservedresults.Thehistogramofobservedresultsiscalledanempiricalhistogram.
Theresultsofrollingthediewillallbeintegersintherange1through6,sowewillwanttousethesamebinsasweusedfortheprobabilityhistogram.Toavoidwritingoutthesamebinargumenteverytimewedrawahistogram,letusdefineafunctioncalleddice_histthatwillperformthetaskforus.Thefunctionwilltakeoneargument:thenumbernofdicetoroll.
ComputationalandInferentialThinking
115Sampling
defdice_hist(n):
rolls=die.sample(n,with_replacement=True)
rolls.hist(bins=dice_bins)
dice_hist(20)
Asweincreasethenumberofrollsinthesimulation,theproportionsgetcloserto .
dice_hist(10000)
Thebehaviorwehaveobservedisaninstanceofageneralrule.
TheLawofAverages¶
ComputationalandInferentialThinking
116Sampling
Ifachanceexperimentisrepeatedindependentlyandunderidenticalconditions,then,inthelongrun,theproportionoftimesthataneventoccursgetscloserandclosertothetheoreticalprobabilityoftheevent.
Forexample,inthelongrun,theproportionoftimesthefacewithfourspotsappearsgetscloserandcloserto1/6.
Here"independentlyandunderidenticalconditions"meansthateveryrepetitionisperformedinthesamewayregardlessoftheresultsofalltheotherrepetitions.
Convergenceofempiricalhistograms¶Wehavealsoobservedthatarandomquantity(suchasthenumberofspotsononerollofadie)isassociatedwithtwohistograms:
aprobabilityhistogram,thatshowsallthepossiblevaluesofthequantityandalltheirchances
anempirialhistogram,createdbysimulatingtherandomquantityrepeatedlyanddrawingahistogramoftheobservedresults
Wehaveseenanexampleofthelong-runbehaviorofempiricalhistograms:
Asthenumberofrepetitionsincreases,theempiricalhistogramofarandomquantitylooksmoreandmoreliketheprobabilityhistogram.
AttheRouletteTable¶Equippedwithournewknowledgeaboutthelong-runbehaviorofchances,letusexploreagamblinggame.BettingonrouletteispopularingamblingcenterssuchasLasVegasandMonteCarlo,andwewillsimulateoneofthebetshere.
ThemainrandomizerinrouletteinNevadaisawheelthathas38pocketsonitsrim.Twoofthepocketsaregreen,eighteenblack,andeighteenred.Thewheelisonaspindle,andthereisasmallballonit.Whenthewheelisspun,theballricochetsaroundandfinallycomestorestinoneofthepockets.Thatisdeclaredtobethewinningpocket.
Youareallowedtobetonseveralpre-specifiedcollectionsofpockets.Ifyoubeton"red,"youwiniftheballcomestorestinoneoftheredpockets.
Thebetevenmoney.Thatis,itpays1to1.Tounderstandwhatthatmeans,assumeyouaregoingtobet$1on"red."Thefirstthingthathappens,evenbeforethewheelisspun,isthatyouhavetohandoveryour$1.Iftheballlandsinagreenorblackpocket,younever
ComputationalandInferentialThinking
117Sampling
seethatdollaragain.Iftheballlandsinaredpocket,yougetyourdollarback(tobringyoubacktoeven),plusanother$1inwinnings.
ThetablewheelrepresentsthepocketsofaNevadaroulettewheel.Ithas38rowslabeled1through38,onerowperpocket.
pockets=np.arange(1,39)
colors=(['red','black']*5+['black','red']*4)*2+['green'
wheel=Table().with_columns([
'pocket',pockets,
'color',colors])
wheel
pocket color
1 red
2 black
3 red
4 black
5 red
6 black
7 red
8 black
9 red
10 black
...(28rowsomitted)
Thefunctionbet_on_redtakesanumericalargumentxandreturnsthenetwinningsona$1beton"red,"providedxisthenumberofapocket.
defbet_on_red(x):
"""Thenetwinningsofbettingonredforoutcomex."""
pockets=wheel.where('pocket',x)
ifpockets.column('color').item(0)=='red':
return1
else:
return-1
ComputationalandInferentialThinking
118Sampling
bet_on_red(17)
-1
Thefunctionspinstakesanumericalargumentnandreturnsanewtableconsistingofnrowsofwheelsampledatrandomwithreplacement.Inotherwords,itsimulatestheresultsofnspinsoftheroulettewheel.
defspins(n):
returnwheel.sample(n,with_replacement=True)
Wewillcreateatablecalledplayconsistingoftheresultsof10spins,andaddacolumnthatshowsthenetwinningson$1placedon"red."Recallthattheapplymethodappliesafunctiontoeachelementinacolumnofatable.
outcomes=spins(500)
results=outcomes.with_column(
'winnings',outcomes.apply(bet_on_red,'pocket'))
results
pocket color winnings
13 black -1
22 black -1
38 green -1
11 black -1
29 black -1
19 red 1
38 green -1
37 green -1
19 red 1
31 black -1
...(490rowsomitted)
Andhereisthenetgainonall500bets:
ComputationalandInferentialThinking
119Sampling
sum(results.column('winnings'))
-32
Betting$1onredhundredsoftimesseemslikeabadideafromagambler'sperspective.Butfromthecasinos'perspectiveitisexcellent.Casinosrelyonlargenumbersofbetsbeingplaced.Thepayoffoddsaresetsothatthemorebetsthatareplaced,themoremoneythecasinosarelikelytomake,eventhoughafewpeoplearelikelytogohomewithwinnings.
SimpleRandomSample–aRandomSamplewithoutReplacement
Arandomsamplewithoutreplacementisoneinwhichelementsaredrawnfromalistrepeatedly,uniformlyatrandom,ateachstagedeletingfromthelisttheelementthatwasdrawn.
Arandomsamplewithoutreplacementisalsocalledasimplerandomsample.Allelementsofthepopulationhavethesamechanceofenteringasimplerandomsample.Allpairshavethesamechanceaseachother,asdoalltriples,andsoon.
Thedefaultactionofsampleistodrawwithoutreplacement.Incardgames,cardsarealmostalwaysdealtwithoutreplacement.Letususesampletodealcardsfromadeck.
Astandarddeckdeckconsistsof13ranksofcardsineachoffoursuits.Thesuitsarecalledspades,clubs,diamonds,andhearts.TheranksareAce,2,3,4,5,6,7,8,9,10,Jack,Queen,andKing.Spadesandclubsareblack;diamondsandheartsarered.TheJacks,Queens,andKingsarecalled'facecards.'
Thetabledeckcontainsall52cardsinacolumnlabeledcards.Theabbreviationsare:
Ace:AJack:JQueen:QKing:K
Below,weusetheproductfunctiontoproduceallpossiblepairsofasuitandarank.Thesuitsarerepresentedbyspecialcharactersusedtoprintbridgeandpokerarticlesinnewspapers.
ComputationalandInferentialThinking
120Sampling
fromitertoolsimportproduct
ranks=['A','2','3','4','5','6','7','8','9','10','J','Q'
suits=['♠','♥','♦','♣']
cards=product(ranks,suits)
deck=Table(['rank','suit']).with_rows(cards)
deck
rank suit
A ♠
A ♥
A ♦
A ♣
2 ♠
2 ♥
2 ♦
2 ♣
3 ♠
3 ♥
...(42rowsomitted)
Apokerhandis5cardsdealtatrandomfromthedeck.Theexpressionbelowdealsahand.Executethecellafewtimes;howoftenareyoudealtapair?
deck.sample(5)
rank suit
5 ♥
4 ♣
9 ♥
3 ♣
10 ♣
ComputationalandInferentialThinking
121Sampling
Ahandisasetoffivecards,regardlessoftheorderinwhichtheyappeared.Forexample,thehand9♣9♥Q♣7♣8♥isthesameasthehand7♣9♣8♥9♥Q♣.
Thisfactcanbeusedtoshowthatsimplerandomsamplingcanbethoughtofintwoequivalentways:
drawingelementsonebyoneatrandomwithoutreplacement
randomlypermuting(thatis,shuffling)thewholelist,andthenpullingoutasetofelementsatthesametime
ComputationalandInferentialThinking
122Sampling
IterationInteract
Repetition¶Itisoftenthecasewhenprogrammingthatyouwillwishtorepeatthesameoperationmultipletimes,perhapswithslightlydifferentbehavioreachtime.Youcouldcopy-pastethecode10times,butthat'stediousandpronetotypos,andifyouwantedtodoitathousandtimes(oramilliontimes),forgetit.
Abettersolutionistouseaforstatementtoloopoverthecontentsofasequence.Aforstatementbeginswiththewordfor,followedbyanamefortheiteminthesequence,followedbythewordin,andendingwithanexpressionthatevaluatestoasequence.Theindentedbodyoftheforstatementisexecutedonceforeachiteminthatsequence.
foriinnp.arange(5):
print(i)
0
1
2
3
4
Atypicaluseofaforstatementistobuildupatablebyrepeatingarandomcomputationmanytimesandstoringeachresultasanewrow.Theappendmethodofatabletakesasequenceandaddsanewrow.It'sdifferentfromwith_rowbecauseanewtableisnotcreated;instead,theoriginaltableisextended.Thecellbelowdraws100cards,butkeepsonlytheaces.
ComputationalandInferentialThinking
123Iteration
aces=Table(['Rank','Suit'])
foriinnp.arange(100):
card=deck.row(np.random.randint(deck.num_rows))
ifcard.item(0)=='A':
aces.append(card)
aces
Rank Suit
A ♠
A ♣
A ♥
A ♦
A ♠
A ♦
Thispatterncanbeusedtotracktheresultsofrepeatedexperiments.Forexample,perhapswewanttolearnabouttheempiricalpropertiesofsomerandomlydrawnpokerhands.Below,wetrackwhetherthehandcontainsfour-of-a-kindorfivecardsofthesamesuit(calledaflush).
hands=Table(['Four-of-a-kind','Flush'])
foriinnp.arange(10000):
hand=deck.sample(5)
four_of_a_kind=max(hand.group('rank').column('count'))==4
flush=max(hand.group('suit').column('count'))==5
hands.append([four_of_a_kind,flush])
hands
ComputationalandInferentialThinking
124Iteration
Four-of-a-kind Flush
False False
False False
False False
False False
False False
False False
False False
False False
False False
False False
...(9990rowsomitted)
Aforstatementcanalsoiterateoverasequenceoflabels.Wecanusethisfeaturetosummarizetheresultsofourexperiment.Thesearerarehandsindeed!
forlabelinhands.labels:
success=np.count_nonzero(hands.column(label))
print('A',label,'wasdrawn',success,'of',hands.num_rows,'times'
AFour-of-a-kindwasdrawn2of10000times
AFlushwasdrawn22of10000times
Randomizedresponse¶Next,we'lllookatatechniquethatwasdesignedseveraldecadesagotohelpconductsurveysofsensitivesubjects.Researcherswantedtoaskparticipantsafewquestions:Haveyoueverhadanaffair?Doyousecretlythinkyouaregay?Haveyouevershoplifted?HaveyoueversungaJustinBiebersongintheshower?Theyfiguredthatsomepeoplemightnotrespondhonestly,becauseofthesocialstigmaassociatedwithanswering"yes".So,theycameupwithacleverwaytoestimatethefractionofthepopulationwhoareinthe"yes"camp,withoutviolatinganyone'sprivacy.
ComputationalandInferentialThinking
125Iteration
Here'stheidea.We'llinstructtherespondenttorollafair6-sideddie,secretly,wherenooneelsecanseeit.Ifthediecomesup1,2,3,or4,thenrespondentissupposedtoanswerhonestly.Ifitcomesup5or6,therespondentissupposedtoanswertheoppositeofwhattheirtrueanswerwouldbe.But,theyshouldn'trevealwhatcameupontheirdie.
Noticehowcleverthisis.Evenifthepersonsays"yes",thatdoesn'tnecessarilymeantheirtrueansweris"yes"--theymightverywellhavejustrolleda5or6.Sotheresponsestothesurveydon'trevealanyoneindividual'strueanswer.Yet,inaggregate,theresponsesgiveenoughinformationthatwecangetaprettygoodestimateofthefractionofpeoplewhosetrueansweris"yes".
Let'stryasimulation,sowecanseehowthisworks.We'llwritesomecodetoperformthisoperation.First,afunctiontosimulaterollingonedie:
defroll_once():
returnnp.random.randint(1,7)
Nowwe'llusethistowriteafunctiontosimulatehowsomeoneissupposedtorespondtothesurvey.Theargumenttothefunctionistheirtrueanswer(TrueorFalse);thefunctionreturnswhatthey'resupposedtotelltheinterview.
defrespond(true_answer):
ifroll_once()>=5:
returnnottrue_answer
else:
returntrue_answer
Wecantryit.Assumeourtrueansweris'no';let'sseewhathappensthistime:
respond(False)
False
Ofcourse,ifyouweretorunitmanytimes,youmightgetadifferentresulteachtime.Below,webuildatableoftheresponsesformanyresponseswhenthetrueanswerisalwaysFalse.
ComputationalandInferentialThinking
126Iteration
responses=Table(['Truth','Response'])
foriinnp.arange(1000):
responses.append([False,respond(False)])
responses
Truth Response
False False
False False
False False
False False
False False
False False
False True
False False
False True
False False
...(990rowsomitted)
Let'sbuildabarchartandlookathowmanyTrueandFalseresponsesweget.
responses.group('Response').barh('Response')
responses.where('Response',False).num_rows
ComputationalandInferentialThinking
127Iteration
656
responses.where('Response',True).num_rows
344
Exerciseforyou:IfNoutof1000responsesareTrue,approximatelywhatfractionofthepopulationhastrulysungaJustinBiebersongintheshower?
Analysis¶Thismethodiscalled"randomizedresponse".Itisonewaytopollpeopleaboutsensitivesubjects,whilestillprotectingtheirprivacy.Youcanseehowitisaniceexampleofrandomnessatwork.
Itturnsoutthatrandomizedresponsehasbeautifulgeneralizations.Forinstance,yourChromewebbrowserusesittoanonymouslyreportfeedbacktoGoogle,inawaythatwon'tviolateyourprivacy.That'sallwe'llsayaboutitforthissemester,butifyoutakeanupper-divisioncourse,maybeyou'llgettoseesomegeneralizationsofthisbeautifultechnique.
Thestepsintherandomizedresponsesurveycanbevisualizedusingatreediagram.Thediagrampartitionsallthesurveyrespondentsaccordingtotheirtrueandanswerandtheanswerthattheyeventuallygive.Italsodisplaystheproportionsofrespondentswhosetrueanswersare1("True")and0("False"),aswellasthechancesthatdeterminetheanswersthattheygive.Asinthecodeabove,wehaveusedptodenotetheproportionwhosetrueansweris1.
ComputationalandInferentialThinking
128Iteration
Therespondentswhoanswer1splitintotwogroups.Thefirstgroupconsistsoftherespondentswhosetrueanswerandgivenanswersareboth1.Ifthenumberofrespondentsislarge,theproportioninthisgroupislikelytobeabout2/3ofp.Thesecondgroupconsistsoftherespondentswhosetrueansweris0andgivenansweris1.Thisproportioninthisgroupislikelytobeabout1/3of1-p.
Wecanobserved ,theproportionof1'samongthegivenanswers.Thus
Thisallowsustosolveforanapproximatevalueofp:
Inthiswaywecanusetheobservedproportionof1'sto"workbackwards"andgetanestimateofp,theproportioninwhichwheareinterested.
Technicalnote.Itisworthnotingtheconditionsunderwhichthisestimateisvalid.Thecalculationoftheproportionsinthetwogroupswhosegivenansweris1reliesoneachofthegroupsbeinglargeenoughsothattheLawofAveragesallowsustomakeestimatesabouthowtheirdicearegoingtoland.Thismeansthatitisnotonlythetotalnumberof
ComputationalandInferentialThinking
129Iteration
respondentsthathastobelarge–thenumberofrespondentswhosetrueansweris1hastobelarge,asdoesthenumberwhosetrueansweris0.Forthistohappen,pmustbeneithercloseto0norcloseto1.Ifthecharacteristicofinterestiseitherextremelyrareorextremelycommoninthepopulation,themethodofrandomizedresponsedescribedinthisexamplemightnotworkwell.
Let'stryoutthismethodonsomerealdata.Thechanceofdrawingapokerhandwithnoacesis
np.product(np.arange(48,43,-1)/np.arange(52,47,-1))
0.65884199833779666
Itisquiteembarassingindeedtodrawahandwithnoaces.Thetablebelowcontainsonecolumnforthetruthofwhetherahandhasnoacesandanotherfortherandomizedresponse.
ace_responses=Table(['Truth','Response'])
foriinnp.arange(10000):
hand=deck.sample(5)
no_aces=hand.where('rank','A').num_rows==0
ace_responses.append([no_aces,respond(no_aces)])
ace_responses
ComputationalandInferentialThinking
130Iteration
Truth Response
False True
True True
False True
True False
True False
True True
False False
True False
False False
True True
...(9990rowsomitted)
Usingourderivedformula,wecanestimatewhatfractionofhandshavenoaces.
3*np.count_nonzero(ace_responses.column('Response'))/10000-1
0.6644000000000001
ComputationalandInferentialThinking
131Iteration
EstimationInteractIntheprevioussectionwesawthattwokindsofhistogramscanbeassociatedwithaquantitygeneratedbyarandomprocess.
Theprobabilityhistogramshowsallthepossiblevaluesofthequantityandtheprobabilitiesofallthosevalues.Theprobabilitiesarecalculatedusingmathandtherulesofchance.
Theempiricalhistogramisbasedonrunningtherandomprocessrepeatedlyandkeepingtrackofthevalueofthequantityeachtime.Itshowsallthevaluesofthequantityandtheproportionoftimeseachvaluewasobservedamongalltherepetitions.
WenotedthatbytheLawofAverages,theempiricalhistogramislikelytoresembletheprobabilityhistogramiftherandomprocessisrepeatedalargenumberoftimes.
Thismeansthatsimulatingrandomprocessesrepeatedlyisawayofapproximatingprobabilitydistributionswithoutfiguringouttheprobabilitiesmathematically.Thuscomputersimulationsbecomeapowerfultoolindatascience.Theycanhelpdatascientistsunderstandthepropertiesofrandomquantitiesthatwouldbecomplicatedtoanalyzeusingmathalone.
Hereisanexampleofsuchasimulation.
Estimatingthenumberofenemyaircraft¶InWorldWarII,dataanalystsworkingfortheAlliesweretaskedwithestimatingthenumberofGermanwarplanes.ThedataincludedtheserialnumbersoftheGermanplanesthathadbeenobservedbyAlliedforces.Theseserialnumbersgavethedataanalystsawaytocomeupwithananswer.
Tocreateanestimateofthetotalnumberofwarplanes,thedataanalystshadtomakesomeassumptionsabouttheserialnumbers.Herearetwosuchassumptions,greatlysimplifiedtomakeourcalculationseasier.
1. ThereareNplanes,numbered .
2. Theobservedplanesaredrawnuniformlyatrandomwithreplacementfromtheplanes.
Thegoalistoestimatethenumber .
ComputationalandInferentialThinking
132Estimation
Supposeyouobservesomeplanesandnotedowntheirserialnumbers.Howmightyouusethedatatoguessthevalueof ?Anaturalandstraightforwardmethodwouldbetosimplyusethelargestserialnumberobserved.
Letusseehowwellthismethodofestimationworks.First,anothersimplification:SomehistoriansnowestimatethattheGermanaircraftindustryproducedalmost100,000warplanesofmanydifferentkinds,Butherewewillimaginejustonekind.ThatmakesAssumption1aboveeasiertojustify.
Supposethereareinfact planesofthiskind,andthatyouobserve30ofthem.Wecanconstructatablecalledserialnothatcontainstheserialnumbers1through .Wecanthensample30timeswithreplacement(seeAssumption2)togetoursampleofserialnumbers.Ourestimateisthemaximumofthese30numbers.
N=300
serialno=Table().with_column('serialnumber',np.arange(1,N+1))
serialno
serialnumber
1
2
3
4
5
6
7
8
9
10
...(290rowsomitted)
max(serialno.sample(30,with_replacement=True).column(0))
296
ComputationalandInferentialThinking
133Estimation
Aswithallcodeinvolvingrandomsampling,runitafewtimestoseethevariation.Youwillobservethatevenwithjust30observationsfromamong300,thelargestserialnumberistypicallyinthe250-300range.
Inprinciple,thelargestserialnumbercouldbeassmallas1,ifyouwereunluckyenoughtoseePlaneNumber1all30times.Anditcouldbeaslargeas300ifyouobservePlaneNumber300atleastonce.Butusually,itseemstobeintheveryhigh200's.Itappearsthatifyouusethelargestobservedserialnumberasyourestimateofthetotal,youwillnotbeveryfarwrong.
Letusgeneratesomedatatoseeifwecanconfirmthis.Wewilluseiterationtorepeatthesamplingprocedurenumeroustimes,eachtimenotingthelargestserialnumberobserved.Thesewouldbeourestimatesof fromallthenumeroussamples.Wewillthendrawahistogramofalltheseestimates,andexaminebyhowmuchtheydifferfrom .
Inthecodebelow,wewillrun750repetitionsofthefollowingprocess:Sample30timesatrandomwithreplacementfrom1through300andnotethelargestnumberobserved.
Todothis,wewilluseaforloop.Asyouhaveseenbefore,wewillstartbysettingupanemptytablethatwilleventuallyholdalltheestimatesthataregenerated.Aseachestimateisthelargestnumberinitssample,wewillcallthistablemaxes.
Foreachinteger(callediinthecode)intherange0through749(750total),theforloopexecutesthecodeinthebodyoftheloop.Inthisexample,itgeneratesarandomsampleof30serialnumbers,computesthemaximumvalue,andaugmentstherowsofmaxeswiththatvalue.
sample_size=30
repetitions=750
maxes=Table(['i','max'])
foriinnp.arange(repetitions):
sample=serialno.sample(sample_size,with_replacement=True)
maxes.append([i,max(sample.column(0))])
maxes
ComputationalandInferentialThinking
134Estimation
i max
0 300
1 300
2 291
3 300
4 297
5 295
6 283
7 296
8 299
9 273
...(740rowsomitted)
Hereisahistogramofthe750estimates.Asyoucansee,theestimatesareallcrowdedupnear300,eventhoughintheorytheycouldbemuchsmaller.Thehistogramindicatesthatasanestimateofthetotalnumberofplanes,thelargestserialnumbermightbetoolowbyabout10to25.Butitisextremelyunlikelytobeunderestimatethetruenumberofplanesbymorethanabout50.
every_ten=np.arange(1,N+100,10)
maxes.hist('max',bins=every_ten)
Theempiricaldistributionofastatistic¶
ComputationalandInferentialThinking
135Estimation
Intheexampleabove,thelargestserialnumberiscalledastatistic(singular!).Astatisticisanynumbercomputedusingthedatainasample.
Thegraphaboveisanempiricalhistogram.Itdisplaystheempiricalorobserveddistributionofthestatistic,basedon750samples.
Thestatisticcouldhavebeendifferent.
Afundamentalconsiderationinusinganystatisticbasedonarandomsampleisthatthesamplecouldhavecomeoutdifferently,andthereforethestatisticcouldhavecomeoutdifferentlytoo.
Justhowdifferentcouldthestatistichavebeen?
Ifyougenerateallofthepossiblesamples,andcomputethestatisticforeachofthem,thenyouwillhaveaclearpictureofhowdifferentthestatisticmighthavebeen.Indeed,youwillhaveacompleteenumerationofallthepossiblevaluesofthestatisticandalltheirprobabilities.Theresultingdistributioniscalledtheprobabilitydistributionofthestatistic,anditshistogramisthesameastheprobabilityhistogram.
Theprobabilitydistributionofastatisticisalsocalledthesamplingdistributionofthestatistic,becauseitisbasedonallofthepossiblesamples.
Thetotalnumberofpossiblesamplesisoftenverylarge.Forexample,thetotalnumberofpossiblesequencesof30serialnumbersthatyoucouldseeiftherewere300aircraftis
300**30
205891132094649000000000000000000000000000000000000000000000000000000000000
That'salotofsamples.Fortunately,wedon'thavetogenerateallofthem.Weknowthattheempiricalhistogramofthestatistic,basedonmanybutnotallofthepossiblesamples,isagoodapproximationtotheprobabilityhistogram.Theempiricaldistributionofastatisticgivesagoodideaofhowdifferentthestatisticcouldbe.
Itistruethattheprobabilitydistributionofastatisticcontainsmoreaccurateinformationaboutthestatisticthananempiricaldistributiondoes.Butoften,asinthisexample,theapproximationprovidedbytheempiricaldistributionissufficientfordatascientiststounderstandhowmuchastatisticcanvary.Andempiricaldistributionsareeasiertocompute.Therefore,datascientistsoftenuseempiricaldistributionsinsteadofexactprobabilitydistributionswhentheyaretryingtounderstandrandomquantities.
AnotherEstimate.
ComputationalandInferentialThinking
136Estimation
Hereisanexampletoillustratethispoint.Thusfar,wehaveusedthelargestobservedserialnumberasanestimateofthetotalnumberofplanes.Butthereareotherpossibleestimates,andwewillnowconsideroneofthem.
Theideaunderlyingthisestimateisthattheaverageoftheobservedserialnumbersislikelybeabouthalfwaybetween1and .Thus,if istheaverage,then
Thusanewstatisticcanbeusedtoestimatethetotalnumberofplanes:taketheaverageoftheobservedserialnumbersanddoubleit.
Howdoesthismethodofestimationcomparewithusingthelargestnumberobserved?Itisnoteasytocalculatetheprobabilitydistributionofthenewstatistic.Butasbefore,wecansimulateittogettheprobabilitiesapproximately.Let'stakealookattheempiricaldistributionbasedonrepeatedsampling.Thenumberofrepetitionsischosentobethesameasitwasfortheearlierestimate.Thiswillallowustocomparethetwoempiricaldistributions.
new_est=Table(['i','2*avg'])
foriinnp.arange(repetitions):
sample=serialno.sample(sample_size,with_replacement=True)
new_est.append([i,2*np.average(sample.column(0))])
new_est.hist('2*avg',bins=every_ten)
Noticethatunliketheearlierestimate,thisonecanoverestimatethenumberofplanes.Thiswillhappenwhentheaverageoftheobservedserialnumbersiscloserto thanto1.
ComputationalandInferentialThinking
137Estimation
Foreaseofcomparison,letusrunanothersetof750repetitionsofthesamplingprocessandcalculatebothestimatesforeachsample.
new_est=Table(['i','max','2*avg'])
foriinnp.arange(repetitions):
sample=serialno.sample(sample_size,with_replacement=True)
new_est.append([i,max(sample.column(0)),2*np.average(sample
new_est.select(['max','2*avg']).hist(bins=every_ten)
Youcanseethattheoldmethodalmostalwaysunderestimates;formally,wesaythatitisbiased.Butithaslowvariability,andishighlylikelytobeclosetothetruetotalnumberofplanes.
Thenewmethodoverestimatesaboutasoftenasitunderestimates,andthusisroughlyunbiasedonaverageinthelongrun.However,itismorevariablethantheoldestimate,andthusispronetolargerabsoluteerrors.
Thisisaninstanceofabias-variancetradeoffthatisnotuncommonamongcompetingestimates.Whichestimateyoudecidetousewilldependonthekindsoferrorsthatmatterthemosttoyou.Inthecaseofenemywarplanes,underestimatingthetotalnumbermighthavegrimconsequences,inwhichcaseyoumightchoosetousethemorevariablemethodthatoverestimatesabouthalfthetime.Ontheotherhand,ifoverestimationleadstoneedlesslyhighcostsofguardingagainstplanesthatdon'texist,youmightbesatisfiedwiththemethodthatunderestimatesbyamodestamount.
ComputationalandInferentialThinking
138Estimation
Technicalnote:Infact,twicetheaverageisnotunbiased.Onaverage,itoverestimatesbyexactly1.Forexample,ifNis3,theaverageofdrawsfrom1,2,and3willbe2,and2times2is4,whichisonemorethanN.Twicetheaverageminus1isanunbiasedestimatorofN.
ComputationalandInferentialThinking
139Estimation
CenterInteractWehaveseenthatstudyingthevariabilityofanestimateinvolvesquestionssuchas:
Whereistheempiricalhistogramcentered?Abouthowfarawayfromcentercouldthevaluesbe?
Inordertodiscussestimatesandtheirvariabilityingreaterdetail,itwillhelptoquantifysomebasicfeaturesofdistributions.
Themeanofalistofnumbers¶
Inthiscourse,wewillusethewords"average"and"mean"interchangeably.
Definition:Theaverageormeanofalistofnumbersisthesumofallthenumbersinthelist,dividedbythenumberofentriesinthelist.
Hereisthedefinitionappliedtoalistof20numbers.
#Thedata:alistofnumbers
x_list=[4,4,4,5,5,5,2,2,2,2,2,2,3,3,3,3,3,3,3,
#Theaverage
sum(x_list)/len(x_list)
3.15
Theaverageofthelistis3.15.
Youcanthinkoftakingthemeanasan"equalizing"or"smoothing"operation.Forexample,imaginetheentriesinx_listbeingtheamountsofmoney(indollars)inthepocketsof20differentpeople.Togetthemean,youfirstputallofthemoneyintoonebigpot,andthendivideitevenlyamongall20people.Theyhadstartedoutwithdifferentamountsofmoneyintheirpockets,butnoweachpersonhas$3.15,theaverageamount.
ComputationalandInferentialThinking
140Center
Computation:Wecouldinsteadhavefoundthemeanbyapplyingthefunctionnp.meantothelistx_list.Differencesinroundingleadtoanswersthatdifferveryslightly.
np.mean(x_list)
3.1499999999999999
Orwecouldhaveplacedthelistinatable,andusedtheTablemethodmean:
x_table=Table().with_column('value',x_list)
x_table['value'].mean()
3.1499999999999999
Themeanandthehistogram¶
Anotherwaytocalculatethemeanistofirstarrangethelistinadistributiontable.Thecolumnscontainthedistinctvaluesinthelist,thecountofeachvalue,andthecorrespondingproportion.Thesumofthecolumncountis20,thenumberofentriesinthelist.
values=Table(['value','count']).with_rows([
[2,6],
[3,8],
[4,3],
[5,3]
])
counts=values.column('count')
x_dist=values.with_column('proportion',counts/sum(counts))
#thedistributiontable
x_dist
ComputationalandInferentialThinking
141Center
value count proportion
2 6 0.3
3 8 0.4
4 3 0.15
5 3 0.15
Thesumofthenumbersinthelistcanbecalculatedbymultiplyingthefirsttwocolumnsandaddingup–four2's,five3's,andsoon–andthendividingby20:
sum(x_dist.column('value')*x_dist.column('count'))/sum(x_dist.column
3.1499999999999999
Thisisthesameasmultiplyingthefirstandthirdcolumns–thevaluesandtheirproportions–andaddingup:
sum(x_dist.column('value')*x_dist.column('proportion'))
3.1500000000000004
Wenowhaveanotherwaytothinkaboutthemean:Themeanofalististheweightedaverageofthedistinctvalues,wheretheweightsaretheproportionsinwhichthosevaluesappear.
Physically,thisimpliesthattheaverageisthebalancepointofthehistogram.Hereisthehistogramofthedistribution.Imaginethatitismadeoutofcardboardandthatyouaretryingtobalancethecardboardfigureatapointonthehorizontalaxis.Theaverageof3.15iswherethefigurewillbalance.
Tounderstandwhythatis,youwillhavetostudysomephysics.Balancepoints,orcentersofgravity,canbecalculatedinexactlythewaywecalculatedthemeanfromthedistributiontable.
x_dist.select(['value','proportion']).hist(counts='value',bins=np
ComputationalandInferentialThinking
142Center
Noticethatthemeanofalistdependsonlyonthedistinctvaluesandtheirproportions;youdonotneedtoknowhowmanyentriesthereareinthelist.
Inotherwords,theaverageisapropertyofthehistogram.Iftwolistshavethesamehistogram,theywillalsohavethesameaverage.
Themeanandthemedian¶
Ifastudent'sscoreonatestisbelowaverage,doesthatimplythatthestudentisinthebottomhalfoftheclassonthattest?
Happilyforthestudent,theansweris,"Notnecessarily."Thereasonhastodowiththerelationbetweentheaverage,whichisthebalancepointofthehistogram,andthemedian,whichisthe"half-waypoint"ofthedata.
Let'scomparethebalancepointofthehistogramtothepointthathas50%oftheareaofthehistogramoneithersideofit.
Therelationshipiseasytoseeinasimpleexample.Hereisthelist1,2,2,3,representedasadistributiontablecalledsymfor"symmetric".
sym=Table().with_columns([
'value',[1,2,3],
'dist',[0.25,0.5,0.25]
])
sym
ComputationalandInferentialThinking
143Center
value dist
1 0.25
2 0.5
3 0.25
Thehistogram(oracalculation)showsthattheaverageandmedianareboth2.
sym.hist(counts='value',bins=np.arange(0.5,10.6,1))
Ingeneral,forsymmetricdistributions,themeanandthemedianareequal.
Whatifthedistributionisnotsymmetric?Wewillexplorethisbymakingachangeinthehistogramabove:wewilltakethebaratthevalue3andslideitovertothevalue10.
#Averageversusmedian:
#Balancepointversus50%point
slide=Table().with_columns([
'value',[1,2,3,10],
'dist1',[0.25,0.5,0.25,0],
'dist2',[0.25,0.5,0,0.25]
])
slide
ComputationalandInferentialThinking
144Center
value dist1 dist2
1 0.25 0.25
2 0.5 0.5
3 0.25 0
10 0 0.25
slide.hist(counts='value',bins=np.arange(0.5,10.6,1))
Thebluehistogramrepresentstheoriginaldistribution;thegoldhistogramstartsoutthesameastheblueattheleftend,butitsrightmostbarhasslidovertothevalue10.Thebrownpartiswherethetwohistogramsoverlap.
Themedianandmeanofthebluedistributionarebothequalto2.Themedianofthegolddistributionisalsoequalto2;theareaoneithersideof2isstill50%,thoughtherighthalfisdistributeddifferentlyfromtheleft.
Butthemeanofthegolddistributionisnot2:thegoldhistogramwouldnotbalanceat2.Thebalancepointhasshiftedtotheright,to3.75.
sum(slide['value']*slide['dist2'])
3.75
Inthegolddistribution,3outof4entries(75%)arebelowaverage.Thestudentwithabelowaveragescorecanthereforetakeheart.Heorshemightbeinthemajorityoftheclass.
ComputationalandInferentialThinking
145Center
Themean,themedian,andskeweddistributions:Ingeneral,ifthehistogramhasatailononeside(theformaltermis"skewed"),thenthemeanispulledawayfromthemedianinthedirectionofthetail.
Example:FlightDelays.TheBureauofTransportationStatisticsprovidesdataondeparturesandarrivalsofflightsintheUnitedStates.Thetableuacontainsthedeparturedelaytimes,inminutes,ofabout37,500UnitedAirlinesflightsoverthepastfewyears.Anegativedelaymeansthattheflightleftearlierthanscheduled.
ontime=Table.read_table('airline_ontime.csv')
ontime.show(3)
Carrier DepartureDelay ArrivalDelay
AA -1 -1
AA -1 -1
AA -1 -1
...(457010rowsomitted)
ua=ontime.where('Carrier','UA').select(1)
ua.show(3)
DepartureDelay
-1
-1
-1
...(37360rowsomitted)
ua_bins=np.arange(-15,121,2)
ua.hist(bins=ua_bins,unit='minute')
ComputationalandInferentialThinking
146Center
Thehistogramofthedelaytimesisright-skewed:ithasalongrighthandtail.
Themeangetspulledawayfromthemedianinthedirectionofthetail.Soweexpectthemeandelaytobelargerthanthemedian,andthatisindeedthecase:
np.mean(ua['DepartureDelay'])
13.885555228434548
np.median(ua['DepartureDelay'])
2.0
ComputationalandInferentialThinking
147Center
SpreadInteract
Variability¶
TheRoughSizeofDeviationsfromAverage¶
Aswehaveseen,inferencebasedonteststatisticsmusttakeintoaccountthewayinwhichthestatisticsvaryacrosssamples.Itisthereforeimportanttobeabletoquantifythevariabilityinanylistofnumbers.Onewayistocreateameasureofthedifferencebetweenthevaluesinthelistandtheaverageofthelist.
Wewillstartbydefiningsuchameasureinthecontextofasimplearrayofjustfournumbers.
numbers=np.array([1,2,2,10])
numbers
array([1,2,2,10])
Thegoalistogetameasureofroughlyhowfaroffthenumbersinthelistarefromtheaverage.Todothis,weneedtheaverage,andallthedeviationsfromtheaverage.A"deviationfromaverage"isjustavalueminustheaverage.
#Step1.Theaverage.
average=np.mean(numbers)
average
3.75
ComputationalandInferentialThinking
148Spread
#Step2.Thedeviationsfromaverage.
deviations=numbers-average
numbers_and_deviations=Table().with_columns([
'value',numbers,
'deviationfromaverage',deviations
])
numbers_and_deviations
value deviationfromaverage
1 -2.75
2 -1.75
2 -1.75
10 6.25
Someofthedeviationsarenegative;thosecorrespondtovaluesthatarebelowaverage.Positivedeviationscorrespondtoabove-averagevalues.
Tocalculateroughlyhowbigthedeviationsare,itisnaturaltocomputethemeanofthedeviations.Butsomethinginterestinghappenswhenallthedeviationsareaddedtogether:
sum(deviations)
0.0
Thepositivedeviationsexactlycanceloutthenegativeones.Thisistrueofalllistsofnumbers,nomatterwhatthehistogramofthelistlookslike:thesumofthedeviationsfromaverageiszero.
Sincethesumofthedeviationsis0,themeanofthedeviationswillbe0aswell:
np.mean(deviations)
0.0
ComputationalandInferentialThinking
149Spread
Becauseofthis,themeanofthedeviationsisnotausefulmeasureofthesizeofthedeviations.Whatwereallywanttoknowisroughlyhowbigthedeviationsare,regardlessofwhethertheyarepositiveornegative.Soweneedawaytoeliminatethesignsofthedeviations.
Therearetwotime-honoredwaysoflosingsigns:theabsolutevalue,andthesquare.Itturnsoutthattakingthesquareconstructsameasurewithextremelypowerfulproperties,someofwhichwewillstudyinthiscourse.
Soletuseliminatethesignsbysquaringallthedeviations.Thenwewilltakethemeanofthesquares:
#Step3.Thesquareddeviationsfromaverage
squared_deviations=deviations**2
numbers_and_deviations.with_column('squareddeviationsfromaverage'
value deviationfromaverage squareddeviationsfromaverage
1 -2.75 7.5625
2 -1.75 3.0625
2 -1.75 3.0625
10 6.25 39.0625
#Variance=themeansquareddeviationfromaverage
variance=np.mean(squared_deviations)
variance
13.1875
Variance:Themeansquareddeviationiscalledthevarianceofasequenceofvalues.Whileitdoesgiveusanideaofspread,itisnotonthesamescaleastheoriginalvariableanditsunitsarethesquareoftheoriginal.Thismakesinterpretationverydifficult.Sowereturntotheoriginalscalebytakingthepositivesquarerootofthevariance:
ComputationalandInferentialThinking
150Spread
#StandardDeviation:rootmeansquareddeviationfromaverage
#Stepsofcalculation:54321
sd=variance**0.5
sd
3.6314597615834874
StandardDeviation¶Thequantitythatwehavejustcomputediscalledthestandarddeviationofthelist,andisabbreviatedasSD.Itmeasuresroughlyhowfarthenumbersonthelistarefromtheiraverage.
Definition.TheSDofalistisdefinedastherootmeansquareofdeviationsfromaverage.That'samouthful.Butreaditfromrighttoleftandyouhavethesequenceofstepsinthecalculation.
Computation.ThenumPyfunctionnp.stdcomputestheSDofvaluesinanarray:
np.std(numbers)
3.6314597615834874
WorkingwiththeSD¶
LetusnowexaminetheSDinthecontextofamoreinterestingdataset.ThetablenbacontainsdataontheplayersintheNationalBasketballAssociation(NBA)in2013.Foreachplayer,thetablerecordsthepositionatwhichtheplayerusuallyplayed,hisheightininches,weightinpounds,andageinyears.
nba=Table.read_table("nba2013.csv")
nba
ComputationalandInferentialThinking
151Spread
Name Position Height Weight Agein2013
DeQuanJones Guard 80 221 23
DariusMiller Guard 80 235 23
TrevorAriza Guard 80 210 28
JamesJones Guard 80 215 32
WesleyJohnson Guard 79 215 26
KlayThompson Guard 79 205 23
ThaboSefolosha Guard 79 215 29
ChaseBudinger Guard 79 218 25
KevinMartin Guard 79 185 30
EvanFournier Guard 79 206 20
...(495rowsomitted)
Hereisahistogramoftheplayers'heights.
nba.select('Height').hist(bins=np.arange(68,88,1))
ItisnosurprisethatNBAplayersaretall!Theiraverageheightisjustover79inches(6'7"),about10inchestallerthantheaverageheightofmenintheUnitedStates.
mean_height=np.mean(nba.column('Height'))
mean_height
ComputationalandInferentialThinking
152Spread
79.065346534653472
Aboutfaroffaretheplayers'heightsfromtheaverage?ThisismeasuredbytheSDoftheheights,whichisabout3.45inches.
sd_height=np.std(nba.column('Height'))
sd_height
3.4505971830275546
ThetoweringcenterHasheemThabeetoftheOklahomaCityThunderwasthetallestplayerataheightof87inches.
nba.sort('Height',descending=True).show(3)
Name Position Height Weight Agein2013
HasheemThabeet Center 87 263 26
RoyHibbert Center 86 278 26
TysonChandler Center 85 235 30
...(502rowsomitted)
Thabeetwasabout8inchesabovetheaverageheight.
87-mean_height
7.9346534653465284
That'sadeviationfromaverage,anditisabout2.3timesthestandarddeviation:
(87-mean_height)/sd_height
2.2995015194397923
Inotherwords,theheightofthetallestplayerwasabout2.3SDsaboveaverage.
ComputationalandInferentialThinking
153Spread
At69inchestall,IsaiahThomaswasoneofthetwoshortestplayersintheNBA.Hisheightwasabout2.9SDsbelowaverage.
nba.sort('Height').show(3)
Name Position Height Weight Agein2013
IsaiahThomas Guard 69 185 24
NateRobinson Guard 69 180 29
JohnLucasIII Guard 71 157 30
...(502rowsomitted)
(69-mean_height)/sd_height
-2.9169868288775844
WhatwehaveobservedisthatthetallestandshortestplayerswerebothjustafewSDsawayfromtheaverageheight.ThisisanexampleofwhytheSDisausefulmeasureofspread.Nomatterwhattheshapeofthehistogram,theaverageandtheSDtogethertellyoualotaboutthedata.
FirstmainreasonformeasuringspreadbytheSD¶
Informalstatement.Foralllists,thebulkoftheentriesarewithintherange"average afewSDs".
Trytoresistthedesiretoknowexactlywhatfuzzywordslike"bulk"and"few"mean.Wewilmakethempreciselaterinthissection.Fornow,letusexaminethestatementinthecontextofsomeexamples.
WehavealreadyseenthatalloftheheightsoftheNBAplayerswereintherange"average3SDs".
Whatabouttheages?Hereisahistogramofthedistribution.Yes,therewasa15-year-oldplayerintheNBAin2013.
nba.select('Agein2013').hist(bins=np.arange(15,45,1))
ComputationalandInferentialThinking
154Spread
JuwanHowardwastheoldestplayer,at40.Hisagewasabout3.2SDsaboveaverage.
nba.sort('Agein2013',descending=True).show(3)
Name Position Height Weight Agein2013
JuwanHoward Forward 81 250 40
MarcusCamby Center 83 235 39
DerekFisher Guard 73 210 39
...(502rowsomitted)
ages=nba['Agein2013']
(40-np.mean(ages))/np.std(ages)
3.1958482778922357
Theyoungestwas15-year-oldJarvisVarnado,whowontheNBAChampionshipthatyearwiththeMiamiHeat.Hisagewasabout2.6SDsbelowaverage.
nba.sort('Agein2013').show(3)
ComputationalandInferentialThinking
155Spread
Name Position Height Weight Agein2013
JarvisVarnado Forward 81 230 15
GiannisAntetokounmpo Forward 81 205 18
SergeyKarasev Guard 79 197 19
...(502rowsomitted)
(15-np.mean(ages))/np.std(ages)
-2.5895811038670811
Whatwehaveobservedfortheheightsandagesistrueingreatgenerality.Foralllists,thebulkoftheentriesarenomorethan2or3SDsawayfromtheaverage.
Theseroughstatementsaboutdeviationsfromaveragearemadeprecisebythefollowingresult.
Chebychev'sinequality¶
Foralllists,andallpositivenumbersz,theproportionofentriesthatareintherange"average SDs"isatleast .
Whatmakesthisresultpowerfulisthatitistrueforalllists–alldistributions,nomatterhowirregular.
Specifically,Chebychev'sinequalitysaysthatforeverylist:
theproportionintherange"average 2SDs"isatleast1-1/4=0.75
theproportionintherange"average 3SDs"isatleast1-1/9 0.89
theproportionintherange"average 4.5SDs"isatleast1-1/ 0.95
Chebychev'sInequalitygivesalowerbound,notanexactansweroranapproximation.Forexample,thepercentofentriesintherange"average SDs"mightbequiteabitlargerthan75%.Butitcannotbesmaller.
Standardunits¶
Inthecalculationsabove, measuresstandardunits,thenumberofstandarddeviationsaboveaverage.
ComputationalandInferentialThinking
156Spread
Somevaluesofstandardunitsarenegative,correspondingtooriginalvaluesthatarebelowaverage.Othervaluesofstandardunitsarepositive.Butnomatterwhatthedistributionofthelistlookslike,Chebychev'sInequalitysaysthatstandardunitswilltypicallybeinthe(-5,5)range.
Toconvertavaluetostandardunits,firstfindhowfaritisfromaverage,andthencomparethatdeviationwiththestandarddeviation.
Aswewillsee,standardunitsarefrequentlyusedindataanalysis.Soitisusefultodefineafunctionthatconvertsanarrayofnumberstostandardunits.
defstandard_units(any_numbers):
"Convertanyarrayofnumberstostandardunits."
return(any_numbers-np.mean(any_numbers))/np.std(any_numbers)
Aswesawinanearliersection,thetableuacontainsacolumnDepartureDelayconsistingofthedeparturedelaytimes,inminutes,ofover37,000UnitedAirlinesflights.WecancreateanewcolumncalledDelay_subyapplyingthefunctionstandard_unitstothecolumnofdelaytimes.Thisallowsustoseeallthedelaytimesinminutesaswellastheircorrespondingvaluesinstandardunits.
ua.append_column('z',standard_units(ua.column('DepartureDelay')))
ua
ComputationalandInferentialThinking
157Spread
DepartureDelay z
-1 -0.391942
-1 -0.391942
-1 -0.391942
-1 -0.391942
-1 -0.391942
-1 -0.391942
-1 -0.391942
-1 -0.391942
-1 -0.391942
-1 -0.391942
...(37353rowsomitted)
Somethingratheralarminghappenswhenwesortthedelaytimesfromhighesttolowest.Thestandardunitsareextremelyhigh!
ua.sort(0,descending=True)
DepartureDelay z
886 22.9631
739 19.0925
678 17.4864
595 15.301
566 14.5374
558 14.3267
543 13.9318
541 13.8791
537 13.7738
513 13.1419
...(37353rowsomitted)
WhatthisshowsisthatitispossiblefordatatobemanySDsaboveaverage(andforflightstobedelayedbyalmost15hours).Thehighestvalueismorethan22instandardunits.
ComputationalandInferentialThinking
158Spread
However,theproportionoftheseextremevaluesistiny,andtheboundsinChebychev'sinequalitystillholdtrue.Forexample,letuscalculatethepercentofdelaytimesthatareintherange"average 2SDs".Thisisthesameasthepercentoftimesforwhichthestandardunitsareintherange(-2,2).Thatisjustabout96%,ascomputedbelow,consistentwithChebychev'sboundof"atleast75%".
within_2_sd=np.logical_and(ua.column('z')>-2,ua.column('z')<
ua.where(within_2_sd).num_rows/ua.num_rows
0.960522441988063
Thehistogramofdelaytimesisshownbelow,withthehorizontalaxisinstandardunits.Bythetableabove,therighthandtailcontinuesallthewayoutto22standardunits(886minutes).Theareaofthehistogramoutsidetherange to isabout4%,puttogetherintinylittlebitsthataremostlyinvisibleinahistogram.
ua.hist('z',bins=np.arange(-5,23,1),unit='standardunit')
ComputationalandInferentialThinking
159Spread
TheNormalDistributionInteractWeknowthatthemeanisthebalancepointofthehistogram.Unlikethemean,theSDisusuallynoteasytoidentifybylookingatthehistogram.
However,thereisoneshapeofdistributionforwhichtheSDisalmostasclearlyidentifiableasthemean.Thissectionexaminesthatshape,asitappearsfrequentlyinprobabilityhistogramsandalsoinmanyhistogramsofdata.
TheSDandthebellcurve¶
Thetablebirthscontainsdataonover1,000babiesbornatahospitalintheBayArea.Thecolumnscontainseveralvariables:thebaby'sbirthweightinounces;thenumberofgestationaldays;themother'sageinyears,heightininches,andpregnancyweightinpounds;andarecordofwhethershesmokedornot.
births=Table.read_table('baby.csv')
births
BirthWeight
GestationalDays
MaternalAge
MaternalHeight
MaternalPregnancyWeight
MaternalSmoker
120 284 27 62 100 False
113 282 33 64 135 False
128 279 28 64 115 True
108 282 23 67 125 True
136 286 25 62 93 False
138 244 33 62 178 False
132 245 23 65 140 False
120 289 25 62 125 False
143 299 30 66 136 True
140 351 27 68 120 False
...(1164rowsomitted)
ComputationalandInferentialThinking
160TheNormalDistribution
Themothers'heightshaveameanof64inchesandanSDof2.5inches.Unliketheheightsofthebasketballplayers,themothers'heightsaredistributedfairlysymmetricallyaboutthemeaninabell-shapedcurve.
heights=births.column('MaternalHeight')
mean_height=np.round(np.mean(heights),1)
mean_height
64.0
sd_height=np.round(np.std(heights),1)
sd_height
2.5
births.hist('MaternalHeight',bins=np.arange(55.5,72.5,1),unit=
positions=np.arange(-3,3.1,1)*sd_height+mean_height
_=plots.xticks(positions)
Thelasttwolinesofcodeinthecellabovechangethelabelingofthehorizontalaxis.Now,thelabelscorrespondto"average SDs"for ,and .Becauseoftheshapeofthedistribution,the"center"hasanunambiguousmeaningandisclearlyvisibleat64.
ComputationalandInferentialThinking
161TheNormalDistribution
ToseehowtheSDisrelatedtothecurve,startatthetopofthecurveandlooktowardstheright.Noticethatthereisaplacewherethecurvechangesfromlookinglikean"upside-downcup"toa"right-way-upcup";formally,thecurvehasapointofinflection.ThatpointisoneSDaboveaverage.Itisthepoint ,whichis"averageplus1SD"=66.5inches.
Symmetricallyontheleft-handsideofthemean,thepointofinflectionisat ,thatis,"averageminus1SD"=61.5inches.
HowtospottheSDonabell-shapedcurve:Forbell-shapeddistributions,theSDisthedistancebetweenthemeanandthepointsofinflectiononeitherside.
Bell-shapedprobabilityhistograms¶
Thereasonforstudyingbell-shapedcurvesisnotjustthattheyappearassomedistributionsofdata,butthattheyappearasapproximationstoprobabilityhistogramsofsampleaverages.
Wehaveseenthisphenomenonalready,intheexampleaboutestimatingthetotalnumberofaircraft.Recallthatwesimulatedrandomsamplesofsize30drawnwithreplacementfromtheserialnumbers1,2,3,...,300,andcomputedtwodifferentquantities:thelargestofthe30numbersobserved,andtwicetheaverageofthe30numbers.Herearethegeneratedhistograms.
N=300
sample_size=30
repetitions=5000
serialno=Table().with_column('serialnumber',np.arange(1,N+1))
new_est=Table(['i','max','2*avg'])
foriinnp.arange(repetitions):
sample=serialno.sample(sample_size,with_replacement=True)
new_est.append([i,max(sample.column(0)),2*np.average(sample
every_ten=np.arange(1,N+100,10)
new_est.select(['max','2*avg']).hist(bins=every_ten)
ComputationalandInferentialThinking
162TheNormalDistribution
Weobservedthatthedistributionof"twicethesampleaverage"isroughlysymmetric;nowlet'salsoobservethatitisroughlybell-shaped.
Ifthedistributionof"twicetheaverage"isbell-shaped,soisthedistributionoftheaverage.Hereisthedistributionoftheaverageofasampleofsize30drawnfromthenumbers1,2,3,...,300.Thenumberofreplicationsofthesamplingprocedureislarge,soitissafetoassumethattheempiricalhistogramresemblestheprobabilityhistogramofthesampleaverage.
sample_means=Table(['Samplemean'])
foriinnp.arange(repetitions):
sample=serialno.sample(sample_size,with_replacement=True)
sample_means.append([np.average(sample.column(0))])
every_five=np.arange(100.5,200.6,5)
sample_means.hist(bins=every_five)
ComputationalandInferentialThinking
163TheNormalDistribution
Wecandrawthiscurvewitharedbell-shapedcurvesuperposed.Thiscurveisgeneratedfromthemeanandstandarddeviationofthesamplemeans.Don'tworryaboutthecodethatdrawsthecurve;justfocusonthediagramitself.
#Importamoduleofstandardstatisticalfunctions
fromscipyimportstats
sample_means.hist(bins=every_five)
means=sample_means['Samplemean']
m=np.mean(means)
s=np.std(means)
x=np.arange(100.5,200.6,1)
plots.plot(x,stats.norm.pdf(x,m,s),color='r',lw=1)
positions=np.arange(-3,3.1,1)*15+150
_=plots.xticks(positions,[int(y)foryinpositions])
ComputationalandInferentialThinking
164TheNormalDistribution
Sinceallthenumbersdrawnarebetween1and300,weexpectthesamplemeantobearound150,whichiswherethehistogramiscentered.
CanyouguesswhattheSDis?Runyoureyealongthecurvetillyoufeelyouareclosetothepointofinflectiontotherightofcenter.Ifyouguessedthatthepointisatabout165,you'recorrect–it'sbetween165and166andsymmetricallybetween134and135totheleftofcenter.
Theprobabilityhistogramofarandomsampleaverage¶
WhatisnotableaboutthisisnotjustthatwecanspottheSD,butthattheprobabilitydistributionofthestatisticisbell-shapedinthefirstplace.ThereisaremarkablepieceofmathematicscalledtheCentralLimitTheoremthatshowsthatthedistributionoftheaverageofalargerandomsamplewillberoughlynormal,nomatterwhatthedistributionofthepopulationfromwhichthesampleisdrawn.
Inourexampleaboutserialnumbers,eachofthe300serialnumbersisequallylikelytobedrawneachtime,sothedistributionofthepopulationisuniformandtheprobabilityhistogramlookslikeabrick.
serialno.hist(bins=np.arange(0.5,300.5,1))
Let'stestouttheCentralLimitTheorembydrawingsamplesfromquiteadifferentdistribution.Hereisthehistogramofover37,000flightdelaytimesinthetableua.Itlooksnothingliketheuniformdistributionabove.
plotrange=np.arange(-15,121,2)
ua.hist(bins=plotrange,unit='minute')
ComputationalandInferentialThinking
165TheNormalDistribution
Wewillnowlookattheprobabilitydistributionoftheaverageof625flighttimesdrawnatrandomfromthispopulation.Thedrawswillbemadewithreplacement,astheywereintheexampleaboutserialnumbers.
sample_size=625
sample_means=Table(['Samplemean'])
foriinnp.arange(repetitions):
sample=ua.sample(sample_size,with_replacement=True)
sample_means.append([np.average(sample.column(0))])
sample_means.hist(bins=np.arange(9,19.1,0.5),unit='minute')
#Drawaredbell-shapedcurvefromthemeanandSDofthesamplemeans
means=sample_means['Samplemean']
m=np.mean(means)
s=np.std(means)
x=np.arange(9,19.1,0.1)
_=plots.plot(x,stats.norm.pdf(x,np.mean(ua[0]),np.std(ua[0])/25
ComputationalandInferentialThinking
166TheNormalDistribution
Theprobabilitiesforthesampleaverageareroughlybell-shaped,consistentwiththeCenralLimitTheorem.
Theaveragedelaytimeofalltheflightsinthepopulationisabout14minutes,whichiswherethehistogramiscentered.TospottheSD,itwillhelptonotethatthebarsarehalfaminutewide.Byeye,thepointsofinflectionlooktobeaboutthreebarsawayfromthecenter.SowewouldguessthattheSDisabout1.5minutes,andinfactthatisnotfarfromtheexactvalue.
Whatistheexactvalue?Wewillanswerthatquestionlaterintheterm.Remarkably,itcanbecalculatedeasilybasedonthesamplesizeandtheSDofthepopulation,withoutdoinganysimulationsatall.
Fornow,hereisthemainpointtonote.
SecondmainreasonforusingtheSDtomeasurespread¶
Foralargerandomsample,theprobabilitydistributionofthesampleaveragewillberoughlybell-shaped,withameanandSDthatcanbeeasilyidentified,nomatterwhatthedistributionofthepopulationfromwhichthesampleisdrawn.
Thestandardnormalcurve¶
Thebell-shapedcurvesabovelookessentiallyallthesameapartfromthelabelsthatwehaveplacedonthehorizontalaxes.Indeed,thereisreallyjustonebasiccurvefromwhichallofthesecurvescanbedrawnjustbyrelabelingtheaxesappropriately.
Todrawthatbasiccurve,wewillusetheunitsintowhichwecanconverteverylist:standardunits.Theresultingcurveisthereforecalledthestandardnormalcurve.
ComputationalandInferentialThinking
167TheNormalDistribution
Thestandardnormalcurvehasanimpressiveequation.Butfornow,itisbesttothinkofitasasmoothedversionofahistogramofavariablethathasbeenmeasuredinstandardunits.
#Thestandardnormalprobabilitydensityfunction(pdf)
plot_normal_cdf()
Asalwayswhenyouexamineanewhistogram,startbylookingatthehorizontalaxis.Onthehorizontalaxisofthestandardnormalcurve,thevaluesarestandardunits.Herearesomepropertiesofthecurve.Someareapparentbyobservation,andothersrequireaconsiderableamountofmathematicstoestablish.
Thetotalareaunderthecurveis1.Soyoucanthinkofitasahistogramdrawntothedensityscale.
Thecurveissymmetricabout0.Soifavariablehasthisdistribution,itsmeanandmedianareboth0.
Thepointsofinflectionofthecurveareat-1and+1.
Ifavariablehasthisdistribution,itsSDis1.ThenormalcurveisoneoftheveryfewdistributionsthathasanSDsoclearlyidentifiableonthehistogram.
Sincewearethinkingofthecurveasasmoothedhistogram,wewillwanttorepresentproportionsofthetotalamountofdatabyareasunderthecurve.Areasundersmoothcurvesareoftenfoundbycalculus,usingamethodcalledintegration.Itisafactofmathematics,however,thatthestandardnormalcurvecannotbeintegratedinanyofthe
ComputationalandInferentialThinking
168TheNormalDistribution
usualwaysofcalculus.Therefore,areasunderthecurvehavetobeapproximated.Thatiswhyalmostallstatisticstextbookscarrytablesofareasunderthenormalcurve.Itisalsowhyallstatisticalsystems,includingthestatsmoduleofPython,includemethodsthatprovideexcellentapproximationstothoseareas.
Areasunderthenormalcurve¶
Thestandardnormalcumulativedistributionfunction(cdf)
Thefundamentalfunctionforfindingareasunderthenormalcurveisstats.norm.cdf.Ittakesanumericalargumentandreturnsalltheareaunderthecurvetotheleftofthatnumber.Formally,itiscalledthe"cumulativedistributionfunction"ofthestandardnormalcurve.Thatratherunwieldymouthfulisabbreviatedascdf.
Letususethisfunctiontofindtheareatotheleftof underthestandardnormalcurve.
#Areaunderthestandardnormalcurve,below1
plot_normal_cdf(1)
stats.norm.cdf(1)
0.84134474606854293
That'sabout84%.Wecannowusethesymmetryofthecurveandthefactthatthetotalareaunderthecurveis1tofindotherareas.
ComputationalandInferentialThinking
169TheNormalDistribution
Forexample,theareatotherightof isabout100%-84%=16%.
#Areaunderthestandardnormalcurve,above1
plot_normal_cdf(lbound=1)
1-stats.norm.cdf(1)
0.15865525393145707
Theareabetween and canbecomputedinseveraldifferentways.
#Areaunderthestandardnormalcurve,between-1and1
plot_normal_cdf(1,lbound=-1)
ComputationalandInferentialThinking
170TheNormalDistribution
Forexample,wecancalculatetheareaas"100%-twoequaltails",whichworksouttoroughly100%-2x16%=68%.
Orwecouldnotethattheareabetween and isequaltoalltheareatotheleftof ,minusalltheareatotheleftof .
stats.norm.cdf(1)-stats.norm.cdf(-1)
0.68268949213708585
Byasimilarcalculation,weseethattheareabetween and2isabout95%.
#Areaunderthestandardnormalcurve,between-2and2
plot_normal_cdf(2,lbound=-2)
ComputationalandInferentialThinking
171TheNormalDistribution
stats.norm.cdf(2)-stats.norm.cdf(-2)
0.95449973610364158
Inotherwords,ifahistogramisroughlybellshaped,theproportionofdataintherange"average 2SDs"isabout95%.
ThatisquiteabitmorethanChebychev'slowerboundof75%.Chebychev'sboundisweakerbecauseithastoworkforalldistributions.Ifweknowthatadistributionisnormal,wehavegoodapproximationstotheproportions,notjustbounds.
Thetablebelowcompareswhatweknowaboutalldistributionsandaboutnormaldistributions.NoticethatChebychev'sboundisnotveryilluminatingwhen .
PercentinRange AllDistributions NormalDistribution
average 1SD atleast0% about68%
average 2SDs atleast75% about95%
average 3SDs atleast88.888...% about99.73%
ComputationalandInferentialThinking
172TheNormalDistribution
Exploration:PrivacySectionauthor:DavidWagner
Interact
Licenseplates¶We'regoingtolookatsomedatacollectedbytheOaklandPoliceDepartment.Theyhaveautomatedlicenseplatereadersontheirpolicecars,andthey'vebuiltupadatabaseoflicenseplatesthatthey'veseen--andwhereandwhentheysaweachone.
Datacollection¶First,we'llgatherthedata.ItturnsoutthedataispubliclyavailableontheOaklandpublicrecordssite.IdownloadeditandcombineditintoasingleCSVfilebymyselfbeforelecture.
lprs=Table.read_table('./all-lprs.csv.gz',compression='gzip',sep
lprs.labels
('red_VRM','red_Timestamp','Location')
Let'sstartbyrenamingsomecolumns,andthentakealookatit.
lprs.relabel('red_VRM','Plate')
lprs.relabel('red_Timestamp','Timestamp')
lprs
ComputationalandInferentialThinking
173Exploration:Privacy
Plate Timestamp Location
1275226 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)
27529C 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)
1158423 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)
1273718 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)
1077682 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)
1214195 01/19/201102:06:00AM (37.798281000000003,-122.27575299999999)
1062420 01/19/201102:06:00AM (37.79833,-122.27574300000001)
1319726 01/19/201102:05:00AM (37.798475000000003,-122.27571500000001)
1214196 01/19/201102:05:00AM (37.798499999999997,-122.27571)
75227 01/19/201102:05:00AM (37.798596000000003,-122.27569)
...(2742091rowsomitted)
Phew,that'salotofdata:wecanseeabout2.7millionlicenseplatereadshere.
Let'sstartbyseeingwhatcanbelearnedaboutsomeone,usingthisdata--assumingyouknowtheirlicenseplate.
StalkingJeanQuan¶Asawarmup,we'lltakealookatex-MayorJeanQuan'scar,andwhereithasbeenseen.Herlicenseplatenumberis6FCH845.(HowdidIlearnthat?Turnsoutshewasinthenewsforgetting$1000ofparkingtickets,andthenewsarticleincludedapictureofhercar,withthelicenseplatevisible.You'dbeamazedbywhat'soutthereontheInternet...)
lprs.where('Plate','6FCH845')
ComputationalandInferentialThinking
174Exploration:Privacy
Plate Timestamp Location
6FCH845 11/01/201209:04:00AM (37.79871,-122.276221)
6FCH845 10/24/201211:15:00AM (37.799695,-122.274868)
6FCH845 10/24/201211:01:00AM (37.799693,-122.274806)
6FCH845 10/24/201210:20:00AM (37.799735,-122.274893)
6FCH845 05/08/201407:30:00PM (37.797558,-122.26935)
6FCH845 12/31/201310:09:00AM (37.807556,-122.278485)
OK,sohercarshowsup6timesinthisdataset.However,it'shardtomakesenseofthosecoordinates.Idon'tknowaboutyou,butIcan'treadGPSsowell.
So,let'sworkoutawaytoshowwherehercarhasbeenseenonamap.We'llneedtoextractthelatitudeandlongitude,asthedataisn'tquiteintheformatthatthemappingsoftwareexpects:themappingsoftwareexpectsthelatitudetobeinonecolumnandthelongitudeinanother.Let'swritesomePythoncodetodothat,bysplittingtheLocationstringintotwopieces:thestuffbeforethecomma(thelatitude)andthestuffafter(thelongitude).
defget_latitude(s):
before,after=s.split(',')#Breakitintotwoparts
lat_string=before.replace('(','')#Getridoftheannoying'('
returnfloat(lat_string)#Convertthestringtoanumber
defget_longitude(s):
before,after=s.split(',')#Breakitintotwoparts
long_string=after.replace(')','').strip()#Getridofthe')'andspaces
returnfloat(long_string)#Convertthestringtoanumber
Let'stestittomakesureitworkscorrectly.
get_latitude('(37.797558,-122.26935)')
37.797558
get_longitude('(37.797558,-122.26935)')
ComputationalandInferentialThinking
175Exploration:Privacy
-122.26935
Good,nowwe'rereadytoaddtheseasextracolumnstothetable.
lprs=lprs.with_columns([
'Latitude',lprs.apply(get_latitude,'Location'),
'Longitude',lprs.apply(get_longitude,'Location')
])
lprs
Plate Timestamp Location Latitude Longitude
1275226 01/19/201102:06:00AM
(37.798304999999999,-122.27574799999999) 37.7983 -122.276
27529C 01/19/201102:06:00AM
(37.798304999999999,-122.27574799999999) 37.7983 -122.276
1158423 01/19/201102:06:00AM
(37.798304999999999,-122.27574799999999) 37.7983 -122.276
1273718 01/19/201102:06:00AM
(37.798304999999999,-122.27574799999999) 37.7983 -122.276
1077682 01/19/201102:06:00AM
(37.798304999999999,-122.27574799999999) 37.7983 -122.276
1214195 01/19/201102:06:00AM
(37.798281000000003,-122.27575299999999) 37.7983 -122.276
1062420 01/19/201102:06:00AM
(37.79833,-122.27574300000001) 37.7983 -122.276
1319726 01/19/201102:05:00AM
(37.798475000000003,-122.27571500000001) 37.7985 -122.276
1214196 01/19/201102:05:00AM
(37.798499999999997,-122.27571) 37.7985 -122.276
75227 01/19/201102:05:00AM
(37.798596000000003,-122.27569) 37.7986 -122.276
...(2742091rowsomitted)
Andatlast,wecandrawamapwithamarkereverywherethathercarhasbeenseen.
jean_quan=lprs.where('Plate','6FCH845').select(['Latitude','Longitude'
Marker.map_table(jean_quan)
ComputationalandInferentialThinking
176Exploration:Privacy
OK,soit'sbeenseenneartheOaklandpolicedepartment.Thisshouldmakeyoususpectwemightbegettingabitofabiasedsample.WhymighttheOaklandPDbethemostcommonplacewherehercarisseen?Canyoucomeupwithaplausibleexplanationforthis?
Pokingaround¶Let'stryanother.Andlet'sseeifwecanmakethemapalittlemorefancy.It'dbenicetodistinguishbetweenlicenseplatereadsthatareseenduringthedaytime(onaweekday),vstheevening(onaweekday),vsonaweekend.Sowe'llcolor-codethemarkers.Todothis,we'llwritesomePythoncodetoanalyzetheTimestampandchooseanappropriatecolor.
ComputationalandInferentialThinking
177Exploration:Privacy
importdatetime
defget_color(ts):
t=datetime.datetime.strptime(ts,'%m/%d/%Y%I:%M:%S%p')
ift.weekday()>=6:
return'green'#Weekend
elift.hour>=6andt.hour<=17:
return'blue'#Weekdaydaytime
else:
return'red'#Weekdayevening
lprs.append_column('Color',lprs.apply(get_color,'Timestamp'))
Nowwecancheckoutanotherlicenseplate,thistimewithourspiffycolor-coding.ThisonehappenstobethecarthatthecityissuestotheFireChief.
t=lprs.where('Plate','1328354').select(['Latitude','Longitude',
Marker.map_table(t)
ComputationalandInferentialThinking
178Exploration:Privacy
Hmm.WecanseeablueclusterindowntownOakland,wheretheFireChief'scarwasseenonweekdaysduringbusinesshours.Ibetwe'vefoundheroffice.Infact,ifyouhappentoknowdowntownOakland,thosearemostlyclusteredrightnearCityHall.Also,hercarwasseentwiceinnorthernOaklandonweekdayevenings.Onecanonlyspeculatewhatthatindicates.Maybedinnerwithafriend?Orrunningerrands?Offtothesceneofafire?Whoknows.Andthenthecarhasbeenseenoncemore,lateatnightonaweekend,inaresidentialareainthehills.Herhomeaddress,maybe?
Let'slookatanother.Thistime,we'llmakeafunctiontodisplaythemap.
defmap_plate(plate):
sightings=lprs.where('Plate',plate)
t=sightings.select(['Latitude','Longitude','Timestamp','Color'
returnMarker.map_table(t)
map_plate('5AJG153')
ComputationalandInferentialThinking
179Exploration:Privacy
Whatcanwetellfromthis?LookstomelikethispersonlivesonInternationalBlvdand9th,roughly.Onweekdaysthey'veseeninavarietyoflocationsinwestOakland.It'sfuntoimaginewhatthismightindicate--deliveryperson?taxidriver?someonerunningerrandsallovertheplaceinwestOakland?
Wecanlookatanother:
map_plate('6UZA652')
ComputationalandInferentialThinking
180Exploration:Privacy
Whatcanwelearnfromthismap?First,it'sprettyeasytoguesswherethispersonlives:16thandInternational,orprettynearthere.AndthenwecanseethemspendingsomenightsandaweekendnearLaneyCollege.Didtheyhaveanapartmenttherebriefly?Arelationshipwithsomeonewholivedthere?
Isanyoneelsegettingalittlebitcreepedoutaboutthis?IthinkI'vehadenoughoflookingatindividualpeople'sdata.
Inference¶Aswecansee,thiskindofdatacanpotentiallyrevealafairbitaboutpeople.Someonewithaccesstothedatacandrawinferences.Takeamomenttothinkaboutwhatsomeonemightbeabletoinferfromthiskindofdata.
Aswe'veseenhere,it'snottoohardtomakeaprettygoodguessatroughlywheresomelives,fromthiskindofinformation:theircarisprobablyparkedneartheirhomemostnights.Also,itwilloftenbepossibletoguesswheresomeoneworks:iftheycommuteintoworkbycar,thenonweekdaysduringbusinesshours,theircarisprobablyparkedneartheiroffice,sowe'llseeaclearclusterthatindicateswheretheywork.
ComputationalandInferentialThinking
181Exploration:Privacy
Butitdoesn'tstopthere.Ifwehaveenoughdata,itmightalsobepossibletogetasenseofwhattheyliketododuringtheirdowntime(dotheyspendtimeatthepark?).Andinsomecasesthedatamightrevealthatsomeoneisinarelationshipandspendingnightsatsomeoneelse'shouse.That'sarguablyprettysensitivestuff.
Thisgetsatoneofthechallengeswithprivacy.Datathat'scollectedforonepurpose(fightingcrime,orsomethinglikethat)canpotentiallyrevealalotmore.Itcanallowtheownerofthedatatodrawinferences--sometimesaboutthingsthatpeoplewouldprefertokeepprivate.Andthatmeansthat,inaworldof"bigdata",ifwe'renotcareful,privacycanbecollateraldamage.
Mitigation¶Ifwewanttoprotectpeople'sprivacy,whatcanbedoneaboutthis?That'salengthysubject.Butatriskofover-simplifying,thereareafewsimplestrategiesthatdataownerscantake:
1. Minimizethedatatheyhave.Collectonlywhattheyneed,anddeleteitafterit'snotneeded.
2. Controlwhohasaccesstothesensitivedata.Perhapsonlyahandfuloftrustedinsidersneedaccess;ifso,thenonecanlockdownthedatasoonlytheyhaveaccesstoit.Onecanalsologallaccess,todetermisuse.
3. Anonymizethedata,soitcan'tbelinkedbacktotheindividualwhoitisabout.Unfortunately,thisisoftenharderthanitsounds.
4. Engagewithstakeholders.Providetransparency,totrytoavoidpeoplebeingtakenbysurprise.Giveindividualsawaytoseewhatdatahasbeencollectedaboutthem.Givepeopleawaytooptoutandhavetheirdatabedeleted,iftheywish.Engageinadiscussionaboutvalues,andtellpeoplewhatstepsyouaretakingtoprotectthemfromunwantedconsequences.
Thisonlyscratchesthesurfaceofthesubject.Mymaingoalinthislecturewastomakeyouawareofprivacyconcerns,sothatifyouareeverastewardofalargedataset,youcanthinkabouthowtoprotectpeople'sdataanduseitresponsibly.
ComputationalandInferentialThinking
182Exploration:Privacy
Chapter4
Prediction
ComputationalandInferentialThinking
183Prediction
CorrelationInteract
Therelationbetweentwovariables¶Intheprevioussections,wedevelopedseveraltoolsthathelpusdescribethedistributionofasinglevariable.Datasciencealsohelpsusunderstandhowmultiplevariablesarerelatedtoeachother.Thisallowsustopredictthevalueofonevariablegiventhevaluesofothers,andtogetasenseoftheamountoferrorintheprediction.
Agoodwaytostartexploringtherelationbetweentwovariablesisbyvisualization.Agraphcalledascatterdiagramcanbeusedtoplotthevalueofonevariableagainstvaluesofanother.Letusstartbylookingatsomescatterdiagramsandthenmoveontoquantifyingsomeofthefeaturesthatwesee.
ThetablehybridcontainsdataonhybridpassengercarssoldintheUnitedStatesfrom1997to2013.ThedatawereobtainedfromtheonlinedataarchiveofProf.LarryWinneroftheUniversityofFlorida.Thecolumnsincludethemodelofthecar,yearofmanufacture,theMSRP(manufacturer'ssuggestedretailprice)in2013dollars,theaccelerationrateinkmperhourpersecond,fuelecononmyinmilespergallon,andthemodel'sclass.
hybrid
vehicle year msrp acceleration mpg class
Prius(1stGen) 1997 24509.7 7.46 41.26 Compact
Tino 2000 35355 8.2 54.1 Compact
Prius(2ndGen) 2000 26832.2 7.97 45.23 Compact
Insight 2000 18936.4 9.52 53 TwoSeater
Civic(1stGen) 2001 25833.4 7.04 47.04 Compact
Insight 2001 19036.7 9.52 53 TwoSeater
Insight 2002 19137 9.71 53 TwoSeater
Alphard 2003 38084.8 8.33 40.46 Minivan
Insight 2003 19137 9.52 53 TwoSeater
Civic 2003 14071.9 8.62 41 Compact
ComputationalandInferentialThinking
184Correlation
...(143rowsomitted)
TheTablemethodscattercanbeusedtoplotaccelerationonthehorizontalaxis(thefirstargument)andmsrponthevertical(thesecondargument).Thereare153pointsinthescatter,oneforeachcarinthetable.
hybrid.scatter('acceleration','msrp')
Thescatterofpointsisslopingupwards,indicatingthatcarswithgreateraccelerationtendedtocostmore,onaverage;conversely,thecarsthatcostmoretendedtohavegreateraccelerationonaverage.
Thisisanexampleofpositiveassociation:above-averagevaluesofonevariabletendtobeassociatedwithabove-averagevaluesoftheother.
ThescatterdiagramofMSRP(verticalaxis)versusmileage(horizontalaxis)showsanegativeassociation.Thescatterhasacleardownwardtrend.Hybridcarswithhighermileagetendedtocostless,onaverage.Thisseemssurprisingtillyouconsiderthatcarsthatacceleratefasttendtobelessfuelefficientandhavelowermileage.Asthepreviousscattershowed,thosewerealsothecarsthattendedtocostmore.
hybrid.scatter('mpg','msrp')
ComputationalandInferentialThinking
185Correlation
Alongwiththenegativeassociation,thescatterdiagramofpriceversusefficiencyshowsanon-linearrelationbetweenthetwovariables.Thepointsappeartobeclusteredaroundacurve.
IfwerestrictthedatajusttotheSUVclass,however,theassociationbetweenpriceandefficiencyisstillnegativebuttherelationappearstobemorelinear.TherelationbetweenthepriceandaccelerationofSUV'salsoshowsalineartrend,butwithapositiveslope.
suv=hybrid.where('class','SUV')
suv.scatter('mpg','msrp')
suv.scatter('acceleration','msrp')
ComputationalandInferentialThinking
186Correlation
Youwillhavenoticedthatwecanderiveusefulinformationfromthegeneralorientationandshapeofascatterdiagramevenwithoutpayingattentiontotheunitsinwhichthevariablesweremeasured.
Indeed,wecouldplotallthevariablesinstandardunitsandtheplotswouldlookthesame.Thisgivesusawaytocomparethedegreeoflinearityintwoscatterdiagrams.
HerearethetwoscatterdiagramsforSUVs,withallthevariablesmeasuredinstandardunits.
Table().with_columns([
'mpg(standardunits)',standard_units(suv.column('mpg')),
'msrp(standardunits)',standard_units(suv.column('msrp'))
]).scatter(0,1)
plots.xlim([-3,3])
plots.ylim([-3,3])
None
ComputationalandInferentialThinking
187Correlation
Table().with_columns([
'acceleration(standardunits)',standard_units(suv.column('acceleration'
'msrp(standardunits)',standard_units(suv.column('msrp'
]).scatter(0,1)
plots.xlim([-3,3])
plots.ylim([-3,3])
None
Theassociationsthatweseeinthesefiguresarethesameasthosewesawbefore.Also,becausethetwoscatterdiagramsaredrawnonexactlythesamescale,wecanseethatthelinearrelationintheseconddiagramisalittlemorefuzzythaninthefirst.
Wewillnowdefineameasurethatusesstandardunitstoquantifythekindsofassociationthatwehaveseen.
ComputationalandInferentialThinking
188Correlation
Thecorrelationcoefficient¶Thecorrelationcoefficientmeasuresthestrengthofthelinearrelationshipbetweentwovariables.Graphically,itmeasureshowclusteredthescatterdiagramisaroundastraightline.
Thetermcorrelationcoefficientisn'teasytosay,soitisusuallyshortenedtocorrelationanddenotedby .
Herearesomemathematicalfactsabout thatwewillobservebysimulation.
Thecorrelationcoefficient isanumberbetween and1.measurestheextenttowhichthescatterplotclustersaroundastraightline.
ifthescatterdiagramisaperfectstraightlineslopingupwards,and ifthescatterdiagramisaperfectstraightlineslopingdownwards.
Thefunctionr_scattertakesavalueof asitsargumentandsimulatesascatterplotwithacorrelationverycloseto .Becauseofrandomnessinthesimulation,thecorrelationisnotexpectedtobeexactlyequalto .
Callr_scatterafewtimes,withdifferentvaluesof astheargument,andseehowthefootballchanges.Positive correspondstopositiveassociation:above-averagevaluesofonevariableareassociatedwithabove-averagevaluesoftheother,andthescatterplotslopesupwards.
When thescatterplotisperfectlylinearandslopesupward.When ,thescatterplotisperfectlylinearandslopesdownward.When ,thescatterplotisaformlesscloudaroundthehorizontalaxis,andthevariablesaresaidtobeuncorrelated.
r_scatter(0.8)
ComputationalandInferentialThinking
189Correlation
r_scatter(0.25)
r_scatter(0)
r_scatter(-0.7)
ComputationalandInferentialThinking
190Correlation
Calculating ¶
Theformulafor isnotapparentfromourobservationssofar,andithasamathematicalbasisthatisoutsidethescopeofthisclass.However,thecalculationisstraightforwardandhelpsusunderstandseveralofthepropertiesof .
Formulafor :
istheaverageoftheproductsofthetwovariables,whenbothvariablesaremeasuredinstandardunits.
Herearethestepsinthecalculation.Wewillapplythestepstoasimpletableofvaluesofand .
x=np.arange(1,7,1)
y=[2,3,1,5,2,7]
t=Table().with_columns([
'x',x,
'y',y
])
t
ComputationalandInferentialThinking
191Correlation
x y
1 2
2 3
3 1
4 5
5 2
6 7
Basedonthescatterdiagram,weexpectthat willbepositivebutnotequalto1.
t.scatter(0,1,s=30,color='red')
Step1.Converteachvariabletostandardunits.
t_su=t.with_columns([
'x(standardunits)',standard_units(x),
'y(standardunits)',standard_units(y)
])
t_su
ComputationalandInferentialThinking
192Correlation
x y x(standardunits) y(standardunits)
1 2 -1.46385 -0.648886
2 3 -0.87831 -0.162221
3 1 -0.29277 -1.13555
4 5 0.29277 0.811107
5 2 0.87831 -0.648886
6 7 1.46385 1.78444
Step2.Multiplyeachpairofstandardunits.
t_product=t_su.with_column('productofstandardunits',t_su.column
t_product
x y x(standardunits) y(standardunits) productofstandardunits
1 2 -1.46385 -0.648886 0.949871
2 3 -0.87831 -0.162221 0.142481
3 1 -0.29277 -1.13555 0.332455
4 5 0.29277 0.811107 0.237468
5 2 0.87831 -0.648886 -0.569923
6 7 1.46385 1.78444 2.61215
Step3. istheaverageoftheproductscomputedinStep2.
#ristheaverageoftheproductsofstandardunits
r=np.mean(t_product.column(4))
r
0.61741639718977093
Asexpected, ispositivebutnotequalto1.
Propertiesof ¶
Thecalculationshowsthat:
ComputationalandInferentialThinking
193Correlation
isapurenumber.Ithasnounits.Thisisbecause isbasedonstandardunits.isunaffectedbychangingtheunitsoneitheraxis.Thistooisbecause isbasedon
standardunits.isunaffectedbyswitchingtheaxes.Algebraically,thisisbecausetheproductof
standardunitsdoesnotdependonwhichvariableiscalled andwhich .Geometrically,switchingaxesreflectsthescatterplotabouttheline ,butdoesnotchangetheamountofclusteringnorthesignoftheassociation.
t.scatter('y','x',s=30,color='red')
Wecandefineafunctioncorrelationtocompute ,basedontheformulathatweusedabove.Theargumentsareatableandtwolabelsofcolumnsinthetable.Thefunctionreturnsthemeanoftheproductsofthosecolumnvaluesinstandardunits,whichis .
defcorrelation(t,x,y):
returnnp.mean(standard_units(t.column(x))*standard_units(t.column
Let'scallthefunctiononthexandycolumnsoft.Thefunctionreturnsthesameanswertothecorrelationbetween and aswegotbydirectapplicationoftheformulafor.
correlation(t,'x','y')
0.61741639718977093
ComputationalandInferentialThinking
194Correlation
Callingcorrelationoncolumnsofthetablesuvgivesusthecorrelationbetweenpriceandmileageaswellasthecorrelationbetweenpriceandacceleration.
correlation(suv,'mpg','msrp')
-0.6667143635709919
correlation(suv,'acceleration','msrp')
0.48699799279959155
Thesevaluesconfirmwhatwehadobserved:
Thereisanegativeassociationbetweenpriceandefficiency,whereastheassociationbetweenpriceandaccelerationispositive.Thelinearrelationbetweenpriceandaccelerationisalittleweaker(correlationabout0.5)thanbetweenpriceandmileage(correlationabout-0.67).
CareinUsingCorrelation¶
Correlationisasimpleandpowerfulconcept,butitissometimesmisused.Beforeusing ,itisimportanttobeawareofwhatcorrelationdoesanddoesnotmeasure.
Correlationonlymeasuresassociation.Correlationdoesnotimplycausation.Thoughthecorrelationbetweentheweightandthemathabilityofchildreninaschooldistrictmaybepositive,thatdoesnotmeanthatdoingmathmakeschildrenheavierorthatputtingonweightimprovesthechildren'smathskills.Ageisaconfoundingvariable:olderchildrenarebothheavierandbetteratmaththanyoungerchildren,onaverage.
Correlationmeasureslinearassociation.Variablesthathavestrongnon-linearassociationmighthaveverylowcorrelation.Hereisanexampleofvariablesthathaveaperfectquadraticrelation buthavecorrelationequalto0.
nonlinear=Table().with_columns([
'x',np.arange(-4,4.1,0.5),
'y',np.arange(-4,4.1,0.5)**2
])
nonlinear.scatter('x','y',s=30,color='r')
ComputationalandInferentialThinking
195Correlation
correlation(nonlinear,'x','y')
0.0
Outlierscanhaveabigeffectoncorrelation.Hereisanexamplewhereascatterplotforwhich isequalto1isturnedintoaplotforwhich isequalto0,bytheadditionofjustoneoutlyingpoint.
line=Table().with_columns([
'x',[1,2,3,4],
'y',[1,2,3,4]
])
line.scatter('x','y',s=30,color='r')
ComputationalandInferentialThinking
196Correlation
correlation(line,'x','y')
1.0
outlier=Table().with_columns([
'x',[1,2,3,4,5],
'y',[1,2,3,4,0]
])
outlier.scatter('x','y',s=30,color='r')
correlation(outlier,'x','y')
0.0
Correlationsbasedonaggregateddatacanbemisleading.Asanexample,herearedataontheCriticalReadingandMathSATscoresin2014.Thereisonepointforeachofthe50statesandoneforWashington,D.C.ThecolumnParticipationRatecontainsthepercentofhighschoolseniorswhotookthetest.Thenextthreecolumnsshowtheaveragescoreinthestateoneachportionofthetest,andthefinalcolumnistheaverageofthetotalscoresonthetest.
sat2014=Table.read_table('sat2014.csv').sort('State')
sat2014
ComputationalandInferentialThinking
197Correlation
State ParticipationRate
CriticalReading Math Writing Combined
Alabama 6.7 547 538 532 1617
Alaska 54.2 507 503 475 1485
Arizona 36.4 522 525 500 1547
Arkansas 4.2 573 571 554 1698
California 60.3 498 510 496 1504
Colorado 14.3 582 586 567 1735
Connecticut 88.4 507 510 508 1525
Delaware 100 456 459 444 1359
DistrictofColumbia 100 440 438 431 1309
Florida 72.2 491 485 472 1448
...(41rowsomitted)
ThescatterdiagramofMathscoresversusCriticalReadingscoresisverytightlyclusteredaroundastraightline;thecorrelationiscloseto0.985.
sat2014.scatter('CriticalReading','Math')
correlation(sat2014,'CriticalReading','Math')
0.98475584110674341
ComputationalandInferentialThinking
198Correlation
ItisimportanttonotethatthisdoesnotreflectthestrengthoftherelationbetweentheMathandCriticalReadingscoresofstudents.Statesdon'ttaketests–studentsdo.Thedatainthetablehavebeencreatedbylumpingallthestudentsineachstateintoasinglepointattheaveragevaluesofthetwovariablesinthatstate.Butnotallstudentsinthestatewillbeatthatpoint,asstudentsvaryintheirperformance.Ifyouplotapointforeachstudentinsteadofjustoneforeachstate,therewillbeacloudofpointsaroundeachpointinthefigureabove.Theoverallpicturewillbemorefuzzy.ThecorrelationbetweentheMathandCriticalReadingscoresofthestudentswillbelowerthanthevaluecalculatedbasedonstateaverages.
Correlationsbasedonaggregatesandaveragesarecalledecologicalcorrelationsandarefrequentlyreported.Aswehavejustseen,theymustbeinterpretedwithcare.
Seriousortongue-in-cheek?¶
In2012,apaperintherespectedNewEnglandJournalofMedicineexaminedtherelationbetweenchocolateconsumptionandNobelPrizesinagroupofcountries.TheScientificAmericanrespondedseriously;othersweremorerelaxed.Thepaperincludedthefollowinggraph:
ComputationalandInferentialThinking
199Correlation
ComputationalandInferentialThinking
200Correlation
RegressionInteract
Theregressionline¶Theconceptsofcorrelationandthe"best"straightlinethroughascatterplotweredevelopedinthelate1800'sandearly1900's.ThepioneersinthefieldwereSirFrancisGalton,whowasacousinofCharlesDarwin,andGalton'sprotégéKarlPearson.Galtonwasinterestedineugenics,andwasameticulousobserverofthephysicaltraitsofparentsandtheiroffspring.Pearson,whohadgreaterexpertisethanGaltoninmathematics,helpedturnthoseobservationsintothefoundationsofmathematicalstatistics.
ThescatterplotbelowisofafamousdatasetcollectedbyPearsonandhiscolleaguesintheearly1900's.Itconsistsoftheheights,ininches,of1,078pairsoffathersandsons.Thescattermethodofatabletakesanoptionalargumentstosetthesizeofthedots.
heights=Table.read_table('heights.csv')
heights.scatter('father','son',s=10)
Noticethefamiliarfootballshapewithadensecenterandafewpointsontheperimeter.Thisshaperesultsfromtwobell-shapeddistributionsthatarecorrelated.Theheightsofbothfathersandsonshaveabell-shapeddistribution.
ComputationalandInferentialThinking
201Regression
heights.hist(bins=np.arange(55.5,80,1),unit='inch')
Theaverageheightofsonsisaboutaninchtallerthantheaverageoffathers.
np.mean(heights.column('son'))-np.mean(heights.column('father'))
0.99740259740261195
Thedifferenceinheightbetweenasonandafathervariesaswell,withthebulkofthedistributionbetween-5inchesand+7inches.
diffs=Table().with_column('son-fatherdifference',
heights.column('son')-heights.column(
diffs.hist(bins=np.arange(-9.5,10,1),unit='inch')
ComputationalandInferentialThinking
202Regression
Thecorrelationbetweentheheightsofthefathersandsonsisabout0.5.
r=correlation(heights,'father','son')
r
0.50116268080759108
Theregressioneffect¶
Forfathersaround72inchesinheight,wemightexpecttheirsonstobetallaswell.Thehistogrambelowshowstheheightofallsonsof72-inchfathers.
six_foot_fathers=heights.where(np.round(heights.column('father'))
six_foot_fathers.hist('son',bins=np.arange(55.5,80,1))
ComputationalandInferentialThinking
203Regression
Most(68%)ofthesonsofthese72-inchfathersarelessthan72inchestall,eventhoughsonsareaninchtallerthanfathersonaverage!
np.count_nonzero(six_foot_fathers.column('son')<72)/six_foot_fathers
0.6842105263157895
Infact,theaverageheightofasonofa72-inchfatherislessthan71inches.Thesonsoftallfathersaresimplynotastallinthissample.
np.mean(six_foot_fathers.column('son'))
70.728070175438603
ThisfactwasnoticedbyGalton,whohadbeenhopingthatexceptionallytallfatherswouldhavesonswhowerejustasexceptionallytall.However,thedatawereclear,andGaltonrealizedthatthetallfathershavesonswhoarenotquiteasexceptionallytall,onaverage.Frustrated,Galtoncalledthisphenomenon"regressiontomediocrity."
Galtonalsonoticedthatexceptionallyshortfathershadsonswhoweresomewhattallerrelativetotheirgeneration,onaverage.Ingeneral,individualswhoareawayfromaverageononevariableareexpectedtobenotquiteasfarawayfromaverageontheother.Thisiscalledtheregressioneffect.
ComputationalandInferentialThinking
204Regression
Thefigurebelowshowsthescatterplotofthedatawhenbothvariablesaremeasuredinstandardunits.Theredlineshowsequalstandardunitsandhasaslopeof1,butitdoesnotmatchtheangleofthecloudofpoints.Theblueline,whichdoesfollowtheangleofthecloud,iscalledtheregressionline,namedforthe"regressiontomediocrity"itpredicts.
heights_standard=Table().with_columns([
'father(standardunits)',standard_units(heights['father']),
'son(standardunits)',standard_units(heights['son'])
])
heights_standard.scatter(0,fit_line=True,s=10)
_=plots.plot([-3,3],[-3,3],color='r',lw=3)
ThescattermethodofaTabledrawsthisregressionlineforuswhencalledwithfit_line=True.Thislinepassesthroughtheresultofmultiplyingfatherheights(instandardunits)by .
father_standard=heights_standard.column(0)
heights_standard.with_column('father(standardunits)*r',father_standard
ComputationalandInferentialThinking
205Regression
Anotherinterpretationofthislineisthatitpassesthroughaveragevaluesforslicesofthepopulationofsons.Toseethisrelationship,wecanroundthefatherseachtothenearestunit,thenaveragetheheightsofallsonsassociatedwiththeseroundedvalues.Thegreenlineistheregressionlineforthesedata,anditpassesclosetoalloftheyellowpoints,whicharemeanheightsofsons(instandardunits).
rounded=heights_standard.with_column('father(standardunits)',np
rounded.join('father(standardunits)',rounded.group(0,np.average
_=plots.plot([-3,3],[-3*r,3*r],color='g',lw=2)
ComputationalandInferentialThinking
206Regression
KarlPearsonusedtheobservationoftheregressioneffectinthedataabove,aswellasinotherdataprovidedbyGalton,todeveloptheformalcalculationofthecorrelationcoefficient.Thatiswhy issometimescalledPearson'scorrelation.
Theregressionlineinoriginalunits¶
Aswesawinthelastsectionforfootballshapedscatterplots,whenthevariables andaremeasuredinstandardunits,thebeststraightlineforestimating basedon hasslopeandpassesthroughtheorigin.Thustheequationoftheregressionlinecanbewrittenas:
Thatis,
Theequationcanbeconvertedintotheoriginalunitsofthedata,eitherbyrearrangingthisequationalgebraically,orbylabelingsomeimportantfeaturesofthelinebothinstandardunitsandintheoriginalunits.
ComputationalandInferentialThinking
207Regression
Calculationoftheslopeandintercept¶
Theregressionlineisalsocommonlyexpressedasaslopeandanintercept,whereanestimateforyiscomputedfromanxvalueusingtheequation
Thecalculationsoftheslopeandinterceptoftheregressionlinecanbederivedfromtheequationabove.
ComputationalandInferentialThinking
208Regression
defslope(table,x,y):
r=correlation(table,x,y)
returnr*np.std(table.column(y))/np.std(table.column(x))
defintercept(table,x,y):
a=slope(table,x,y)
returnnp.mean(table.column(y))-a*np.mean(table.column(x))
[slope(heights,'father','son'),intercept(heights,'father','son'
[0.51400591254559247,33.892800540661682]
Itisworthnotingthattheinterceptofapproximately33.89inchesisnotintendedasanestimateoftheheightofasonwhosefatheris0inchestall.Thereisnosuchsonandnosuchfather.Theinterceptismerelyageometricoralgebraicquantitythathelpsdefinetheline.Ingeneral,itisnotagoodideatoextrapolate,thatis,tomakeestimatesoutsidetherangeoftheavailabledata.Itiscertainlynotagoodideatoextrapolateasfarawayas0isfromtheheightsofthefathersinthestudy.
Itisalsoworthnotingthattheslopeisnotr!Instead,itisrmultipliedbytheratioofstandarddeviations.
correlation(heights,'father','son')
0.50116268080759108
Fittedvalues¶Wecanalsoestimateeverysoninthedatausingtheslopeandintercept.Theestimatedvaluesof arecalledthefittedvalues.Theyalllieonastraightline.Tocalculatethem,takeason'sheight,multiplyitbytheslopeoftheregressionline,andaddtheintercept.Inotherwords,calculatetheheightoftheregressionlineatthegivenvalueof .
ComputationalandInferentialThinking
209Regression
deffit(table,x,y):
"""Returntheheightoftheregressionlineateachxvalue."""
a=slope(table,x,y)
b=intercept(table,x,y)
returna*table.column(x)+b
fitted=heights.with_column('son(fitted)',fit(heights,'father',
fitted
father son son(fitted)
65 59.8 67.3032
63.3 63.2 66.4294
65 63.3 67.3032
65.8 62.8 67.7144
61.1 64.3 65.2986
63 64.2 66.2752
65.4 64.1 67.5088
64.7 64 67.149
66.1 64.6 67.8686
67 64 68.3312
...(1068rowsomitted)
Inoriginalunits,wecanseethatthesefittedvaluesfallonaline.
fitted.scatter(0,s=10)
ComputationalandInferentialThinking
210Regression
Moreover,thisline(ingreenbelow),passesneartheaverageheightsofsonsforslicesofthepopulation.Inthiscase,weroundtheheightsoffatherstothenearestinchtoconstructtheslices.
rounded=heights.with_column('father',np.round(heights.column('father'
rounded.join('father',rounded.group(0,np.average)).scatter(0,s=80
_=plots.plot(fitted.column('father'),fitted.column('son(fitted)'
ComputationalandInferentialThinking
211Regression
ComputationalandInferentialThinking
212Regression
PredictionInteractInthelastsection,wedevelopedtheconceptsofcorrelationandregressionaswaystodescribedata.Wewillnowseehowtheseconceptscanbecomepowerfultoolsforprediction,whenusedappropriately.Inparticular,givenpairedobservationsoftwoquantitiesandsomeadditionalobservationsofonlythefirstquantity,wecaninfertypicalvaluesforthesecondquantity.
Assumptionsofrandomness:a"regressionmodel"¶Predictionispossibleifwebelievethatascatterplotreflectstheunderlyingrelationbetweenthetwovariablesbeingplotted,butdoesnotspecifytherelationcompletely.Forexample,ascatterplotoftheheightsoffathersandsonsshowsusthepreciserelationbetweenthetwovariablesinoneparticularsampleofmen;butwemightwonderwhetherthatrelationholdstrue,oralmosttrue,amongallfathersandsonsinthepopulationfromwhichthesamplewasdrawn,orindeedamongfathersandsonsingeneral.
Asalways,inferentialthinkingbeginswithacarefulexaminationoftheassumptionsaboutthedata.Setsofassumptionsareknownasmodels.Setsofassumptionsaboutrandomnessinroughlylinearscatterplotsarecalledregressionmodels.
Inbrief,suchmodelssaythattheunderlyingrelationbetweenthetwovariablesisperfectlylinear;thisstraightlineisthesignalthatwewouldliketoidentify.However,wearenotabletoseethelineclearly,becausethereisadditionalvariabilityinthedatabeyondtherelation.Whatweseearepointsthatarescatteredaroundtheline.Foreachofthepoints,thelinearsignalthatrelatesthetwovariableshasbeencombinedwithsomeothersourceofvariabilitythathasnothingtodowiththerelationatall.Thisrandomnoiseobfuscatesthelinearsignal,anditisourinferentialgoaltoidentifythesignaldespitethenoise.
Inordertomakethesestatementsmoreprecise,wewillusethenp.random.normalfunction,whichgeneratessamplesfromanormaldistributionwith0meanand1standarddeviation.Thesesamplescorrespondtoabell-shapeddistributionexpressedinstandardunits.
np.random.normal()
ComputationalandInferentialThinking
213Prediction
-0.6423537870160524
samples=Table('x')
foriinnp.arange(10000):
samples.append([np.random.normal()])
samples.hist(0,bins=np.arange(-4,4,0.1))
Theregressionmodelspecifiesthatthepointsinascatterplot,measuredinstandardunits,aregeneratedatrandomasfollows.
Thexaresampledfromthenormaldistribution.
Foreachx,itscorrespondingyissampledusingthesignal_and_noisefunctionbelow,whichtakesxandthecorrelationcoefficientr.
defsignal_and_noise(x,r):
returnr*x+np.random.normal()*(1-r**2)**0.5
Forexample,ifwechoosertobeahalf,thenthefollowingscatterdiagrammightresult.
ComputationalandInferentialThinking
214Prediction
defregression_model(r,sample_size):
pairs=Table(['x','y'])
foriinnp.arange(sample_size):
x=np.random.normal()
y=signal_and_noise(x,r)
pairs.append([x,y])
returnpairs
regression_model(0.5,1000).scatter('x','y')
Basedonthisscatterplot,howshouldweestimatethetrueline?Thebestlinethatwecanputthroughascatterplotistheregressionline.Sotheregressionlineisanaturalestimateofthetrueline.
Thesimulationbelowdrawsboththetruelineusedtogeneratethedata(green)andtheregressionline(blue)foundbyanalyzingthepairsthatweregenerated.
ComputationalandInferentialThinking
215Prediction
defcompare(true_r,sample_size):
pairs=regression_model(true_r,sample_size)
estimated_r=correlation(pairs,'x','y')
pairs.scatter('x','y',fit_line=True)
plt.plot([-3,3],[-3*true_r,3*true_r],color='g',lw=2)
print("Thetrueris",true_r,"andtheestimatedris",estimated_r
compare(0.5,1000)
Thetrueris0.5andtheestimatedris0.459672161067
With1000samples,weseethatthetruelineandtheregressionlinearequitesimilar.Withfewersamples,theregressionestimateofthetruelinethatgeneratedthedatamaybequitedifferent.
compare(0.5,50)
Thetrueris0.5andtheestimatedris0.627329170668
ComputationalandInferentialThinking
216Prediction
Withveryfewsamples,theregressionlinemayhaveaclearpositiveornegativeslope,evenwhenthetruerelationshiphasacorrelationof0.
compare(0,10)
Thetrueris0andtheestimatedris0.501582566452
ComputationalandInferentialThinking
217Prediction
Withrealdata,wewillneverobservethetrueline.Infact,regressionisoftenappliedwhenweareuncertainwhetherthetruerelationbetweentwoquantitiesislinearatall.Whatthesimulationshowsthatiftheregressionmodellooksplausible,andwehavealargesample,thenregressionlineisagoodapproximationtothetrueline.
Predictions¶Hereisanexamplewhereregressionmodelcanbeusedtomakepredictions.
ThedataareasubsetoftheinformationgatheredinarandomizedcontrolledtrialabouttreatmentsforHodgkin'sdisease.Hodgkin'sdiseaseisacancerthattypicallyaffectsyoungpeople.Thediseaseiscurablebutthetreatmentisveryharsh.Thepurposeofthetrialwastocomeupwithdosagethatwouldcurethecancerbutminimizetheadverseeffectsonthepatients.
Thistablehodgkinscontainsdataontheeffectthatthetreatmenthadonthelungsof22patients.Thecolumnsare:
HeightincmAmeasureofradiationtothemantle(neck,chest,underarms)AmeasureofchemotherapyAscoreofthehealthofthelungsatbaseline,thatis,atthestartofthetreatment;higherscorescorrespondtomorehealthylungsThesamescoreofthehealthofthelungs,15monthsaftertreatment
ComputationalandInferentialThinking
218Prediction
hodgkins=Table.read_table('hodgkins.csv')
hodgkins.show()
height rad chemo base month15
164 679 180 160.57 87.77
168 311 180 98.24 67.62
173 388 239 129.04 133.33
157 370 168 85.41 81.28
160 468 151 67.94 79.26
170 341 96 150.51 80.97
163 453 134 129.88 69.24
175 529 264 87.45 56.48
185 392 240 149.84 106.99
178 479 216 92.24 73.43
179 376 160 117.43 101.61
181 539 196 129.75 90.78
173 217 204 97.59 76.38
166 456 192 81.29 67.66
170 252 150 98.29 55.51
165 622 162 118.98 90.92
173 305 213 103.17 79.74
174 566 198 94.97 93.08
173 322 119 85 41.96
173 270 160 115.02 81.12
183 259 241 125.02 97.18
188 238 252 137.43 113.2
Itisevidentthatthepatients'lungswerelesshealthy15monthsafterthetreatmentthanatbaseline.At36months,theydidrecovermostoftheirlungfunction,butthosedataarenotpartofthistable.
Thescatterplotbelowshowsthattallerpatientshadhigherscoresat15months.
ComputationalandInferentialThinking
219Prediction
hodgkins.scatter('height','month15',fit_line=True)
Thescatterplotlooksroughlylinear.Letusassumethattheregressionmodelholds.Underthatassumption,theregressionlinecanbeusedtopredictthe15-monthscoreofanewpatient,basedonthepatient'sheight.Thepredictionwillbegoodprovidedtheassumptionsoftheregressionmodelarejustifiedforthesedata,andprovidedthenewpatientissimilartothoseinthestudy.
Thefunctionregressgivesustheslopeandinterceptoftheregressionline.Topredictthe15-monthscoreofanewpatientwhoseheightis cm,weusethefollowingcalculation:
Predicted15-monthscoreofapatientwhois inchestall slope intercept
Thepredicted15-monthscoreofapatientwhois173cmtallis83.66cm,andthatofapatientwhois163cmtallis73.69cm.
#slopeandintercept
a=slope(hodgkins,'height','month15')
b=intercept(hodgkins,'height','month15')
ComputationalandInferentialThinking
220Prediction
#Newpatient,173cmtall
#Predicted15-monthscore:
a*173+b
83.65699801460444
#Newpatient,163cmtall
#Predicted15-monthscore:
a*163+b
73.694360467072741
Thelogicofregressionpredictioncanbewrappedinafunctionpredictthattakesatable,thetwocolumnsofinterest,andsomenewvalueforx.Itpredictstheaveragevaluefory.
defpredict(t,x,y,new_x_value):
a=slope(t,x,y)
b=intercept(t,x,y)
returna*new_x_value+b
predict(hodgkins,'height','month15',163)
73.694360467072741
VariabilityofthePrediction¶Asdatascientistsworkingundertheregressionmodel,wemustalwaysremainawarethatasamplemighthavebeendifferent.Hadthesamplebeendifferent,theregressionlinewouldhavebeendifferenttoo,andsowouldourprediction.
ComputationalandInferentialThinking
221Prediction
Tounderstandhowmuchapredictionmightvary,wecanselectdifferentsamplesfromthedatawehave.Below,weselectonly15ofthe22rowsofthehodgkinstableatrandomwithoutreplacement,thenmakesaprediction.
predict(hodgkins.sample(15),'height','month15',163)
71.653856125189805
Thepredictionisdifferent,buthowmuchmightweexpectittodiffer?Fromasinglesample,itisnotpossibletoanswerthisquestion.Thesimulationbelowdraws1000samplesof15rowseach,thendrawsahistogramofthepredictedvaluesusingeachsample.
defsample_predict(new_x_value):
predictions=Table(['Predictions'])
foriinnp.arange(1000):
predicted=predict(hodgkins.sample(15),'height','month15'
predictions.append([predicted])
predictions.hist(0,bins=np.arange(40,96,2))
sample_predict(163)
Usingonly15rowsinsteadofall22,thepredictioncouldhavecomeoutdifferentlyfrom73.69.Thishistogramshowshowdifferentthepredictionmightbe.Almostallpredictionsfallbetween64and82,butafewareoutsideofthisrange.
ComputationalandInferentialThinking
222Prediction
Foranxvaluethatisclosertothemeanofallheights,thepredictionsfrom15rowsaremoretightlyclustered.
sample_predict(173)
Inthishistogramofthepredictionsforaheightof173,weseethatalmostallpredictionsfallbetween76and90.Thenarrowrange80to86accountsfornearly70%oftheestimates.
Predictionsofvaluesoutsidethetypicalobservationsofheightvarywildly.Adatascientistisrarelyjustifiedinextrapolatingalineartrendbeyondtherangeofvaluesobserved.
sample_predict(150)
ComputationalandInferentialThinking
223Prediction
Higher-OrderFunctionsInteractThefunctionswehavedefinedsofarcomputetheiroutputsdirectlyfromtheirinputs.Theirreturnvaluescanbecomputedfromtheirarguments.However,somefunctionsrequireadditionaldatatodeteriminetheirreturnvalues.
Let'sstartwithasimpleexample.Considerthefollowingthreefunctionsthatallscaletheirinputbyaconstantfactor.
defdouble(x):
return2*x
deftriple(x):
return3*x
defhalve(x):
return0.5*x
[double(4),triple(4),halve(4),double(5),triple(5),halve(5)]
[8,12,2.0,10,15,2.5]
Ratherthandefiningeachofthesefunctionswithaseparatedefstatement,itispossibletogenerateeachofthemdynamicallybypassinginaconstanttoagenericscale_byfunction.
defscale_by(a):
defscale(x):
returna*x
returnscale
double=scale_by(2)
triple=scale_by(3)
halve=scale_by(0.5)
[double(4),triple(4),halve(4),double(5),triple(5),halve(5)]
ComputationalandInferentialThinking
224Higher-OrderFunctions
[8,12,2.0,10,15,2.5]
Thebodyofthescale_byfunctionactuallydefinesanewfunction,calledscale,andthenreturnsit.Thereturnvalueofscale_byisafunctionthattakesanumberxandmultipliesitbya.Becausescale_byisafunctionthatreturnsafunction,itiscalledahigher-orderfunction.
Thepurposeofhigher-orderfunctionsistocreatefunctionsthatcombinedatawithbehavior.Above,thedoublefunctioniscreatedbycombiningthedatavalue2withthebehaviorofscalingbyaconstant.
Adefstatementwithinanotherdefstatementbehavesinthesamewayasanyotherdefstatementexceptthat:
1. Thenameoftheinnerfunctionisonlyaccessiblewithintheouterfunction'sbody.Forexample,itisnotpossibletousethenamescaleoncescale_byhasreturned.
2. Thebodyoftheinnerfunctioncanrefertotheargumentsandnameswithintheouterfunction.So,forexample,thebodyoftheinnerscalefunctioncanrefertoa,whichisanargumentoftheouterfunctionscale_by.
Thefinallineoftheouterscale_byfunctionisreturnscale.Thislinereturnsthefunctionthatwasjustdefined.Thepurposeofreturningafunctionistogiveitaname.Theline
double=scale_by(2)
givesthenamedoubletotheparticularinstanceofthescalefunctionthathasaassignedtothenumber2.Noticethatitispossibletokeepmultipleinstancesofthescalefunctionaroundatonce,allwithdifferentnamessuchasdoubleandtriple,andPythonkeepstrackofwhichvalueforatouseeachtimeoneofthesefunctionsiscalled.
Example:StandardUnits¶Wehaveseenhowtoconvertalistofnumberstostandardunitsbysubtractingthemeananddividingbythestandarddeviation.
defstandard_units(any_numbers):
"Convertanyarrayofnumberstostandardunits."
return(any_numbers-np.mean(any_numbers))/np.std(any_numbers)
ComputationalandInferentialThinking
225Higher-OrderFunctions
Thisfunctioncanconvertanylistofnumberstostandardunits.
observations=[3,4,2,4,3,5,1,6]
np.round(standard_units(observations),3)
array([-0.333,0.333,-1.,0.333,-0.333,1.,-1.667,1.667])
Howwouldweanswerthequestion,"whatisthenumber8(inoriginalunits)convertedtostandardunits?"
Adding8tothelistofobservationsandre-computingstandard_unitsdoesnotanswerthisquestion,becauseaddinganobservationchangesthescale!
observations_with_eight=[3,4,2,4,3,5,1,6,8]
standard_units(observations_with_eight)
array([-0.5,0.,-1.,0.,-0.5,0.5,-1.5,1.,2.])
Ifinsteadwewanttomaintaintheoriginalscale,weneedafunctionthatrememberstheoriginalmeanandstandarddeviation.Thisscenario,wherebothdataandbehaviorarerequiredtoarriveattheanswer,isanaturalfitforahigher-orderfunction.
defconverter_to_standard_units(any_numbers):
"""Returnafunctionthatconvertstothestandardunitsofany_numbers."""
mean=np.mean(any_numbers)
sd=np.std(any_numbers)
defconvert(a_number_in_original_units):
return(a_number_in_original_units-mean)/sd
returnconvert
Now,wecanusetheoriginallistofnumberstocreatetheconversionfunction.Themeanandstandarddeviationarestoredwithinthefunctionthatisreturned,calledto_subelow.What'smore,thesenumbersareonlycomputedonce,nomatterhowmanytimeswecallthefunction.
ComputationalandInferentialThinking
226Higher-OrderFunctions
observations=[3,4,2,4,3,5,1,6]
to_su=converter_to_standard_units(observations)
[to_su(2),to_su(3),to_su(8)]
[-1.0,-0.33333333333333331,3.0]
Acomplementaryfunctionthatreturnsaconverterfromstandardunitstooriginalunitsinvolvesmultipyingbythestandarddeviation,thenaddingthemean.
defconverter_from_standard_units(any_numbers):
"""Returnafunctionthatconvertsfromthestandardunitsofany_numbers."""
mean=np.mean(any_numbers)
sd=np.std(any_numbers)
defconvert(in_standard_units):
returnin_standard_units*sd+mean
returnconvert
from_su=converter_from_standard_units(observations)
from_su(3)
8.0
Itshouldbethecasethatanynumberconvertedtostandardunitsandback,usingthesamelisttocreatebothfunctions.
from_su(to_su(11))
11.0
Example:Prediction¶Previously,inordertocomputethefittedvaluesforaregression,wefirstcomputedtheslopeandinterceptoftheregressionline.Asanalternateimplementation,foreachx,wecanconvertxtostandardunitsfromtheoriginalunitsofxvalues,thenmultiplybyr,then
ComputationalandInferentialThinking
227Higher-OrderFunctions
converttheresulttotheoriginalunitsofyvalues.Wecanwritedownthisprocessasahigher-orderfunction.
defregression_estimator(table,x,y):
"""Returnanestimatorforyasafunctionofx."""
to_su=converter_to_standard_units(table.column(x))
from_su=converter_from_standard_units(table.column(y))
r=correlation(table,x,y)
defestimate(any_x):
returnfrom_su(r*to_su(any_x))
returnestimate
ThefollowingdatasetofmothersandbabieswascollectedfromtheKaiserHospitalinOakland,California.
baby=Table.read_table('baby.csv')
baby
BirthWeight
GestationalDays
MaternalAge
MaternalHeight
MaternalPregnancyWeight
MaternalSmoker
120 284 27 62 100 False
113 282 33 64 135 False
128 279 28 64 115 True
108 282 23 67 125 True
136 286 25 62 93 False
138 244 33 62 178 False
132 245 23 65 140 False
120 289 25 62 125 False
143 299 30 66 136 True
140 351 27 68 120 False
...(1164rowsomitted)
Gestationaldaysarethenumberofdaysthatthemotherispregnant.Mostpregnancieslastbetween250and310days.
ComputationalandInferentialThinking
228Higher-OrderFunctions
baby.hist(1,bins=np.arange(200,330,10))
Forthiscollectionoftypicalbirths,gestationaldaysandbirthweightappeartobelinearlyrelated.
typical=baby.where(np.logical_and(baby.column(1)>=250,baby.column
typical.scatter(1,0,fit_line=True)
ComputationalandInferentialThinking
229Higher-OrderFunctions
Thefunctionaverage_weightreturnstheaverageweightofbabiesbornafteracertainnumberofgestationaldays,usingtheregressionlinetopredictthisvalue.Thisfunctionisdefinedwithinregression_estimatorandreturned.Thereturnedvalueisboundtothenameaverage_weightusinganassignmentstatement.
average_weight=regression_estimator(typical,1,0)
Theaverage_weightfunctioncontainsalltheinformationneededtomakebirthweightpredictions,butwhenitiscalled,ittakesonlyasingleargument:thegestationaldays.Weavoidtheneedtopassthetableandcolumnlabelsintotheaverage_weightfunctioneachtimethatwecallitbecausethosevalueswerepassedtoregression_estimator,whichdefinedtheaverage_weightfunctionintermsofthosevalues.
average_weight(290)
126.33785752202166
Thisfunctionisquiteflexible.Forinstance,wecancomputefittedvaluesforallgestationaldaysbyapplyingittothecolumnofgestationaldays.
fitted=typical.apply(average_weight,1)
typical.select([1,0]).with_column('BirthWeight(fitted)',fitted)
ComputationalandInferentialThinking
230Higher-OrderFunctions
Higher-orderfunctionsappearofteninthecontextofpredictionbecauseestimatingavalueofteninvolvesbothdatatoinformthepredictionandaprocedurethatmakesthepredictionitself.
ComputationalandInferentialThinking
231Higher-OrderFunctions
ErrorsInteract
Errorintheregressionestimate¶
Thoughtheaverageresidualis0,eachindividualresidualisnot.Someresidualsmightbequitefarfrom0.Togetasenseoftheamountoferrorintheregressionestimate,wewillstartwithagraphicaldescriptionofthesenseinwhichtheregressionlineisthe"best".
Ourexampleisadatasetthathasonepointforeverychapterofthenovel"LittleWomen."Thegoalistoestimatethenumberofcharacters(thatis,letters,punctuationmarks,andsoon)basedonthenumberofperiods.Recallthatweattemptedtodothisintheveryfirstlectureofthiscourse.
little_women=Table.read_table('little_women.csv')
little_women.show(3)
Characters Periods
21759 189
22148 188
20558 231
...(44rowsomitted)
#Onepointforeachchapter
#Horizontalaxis:numberofperiods
#Verticalaxis:numberofcharacters(asina,b,",?,etc;notpeopleinthebook)
little_women.scatter('Periods','Characters')
ComputationalandInferentialThinking
232Errors
correlation(little_women,'Periods','Characters')
0.92295768958548163
Thescatterplotisremarkablyclosetolinear,andthecorrelationismorethan0.92.
Thefigurebelowshowsthescatterplotandregressionline,withfouroftheerrorsmarkedinred.
#Residuals:Deviationsfromtheregressionline
lw_slope=slope(little_women,'Periods','Characters')
lw_intercept=intercept(little_women,'Periods','Characters')
print('Slope:',np.round(lw_slope))
print('Intercept:',np.round(lw_intercept))
lw_errors(lw_slope,lw_intercept)
Slope:87.0
Intercept:4745.0
ComputationalandInferentialThinking
233Errors
Hadweusedadifferentlinetocreateourestimates,theerrorswouldhavebeendifferent.Thepicturebelowshowshowbigtheerrorswouldbeifweweretouseaparticularlysillylineforestimation.
#Errors:Deviationsfromadifferentline
lw_errors(-100,50000)
Belowisalinethatwehaveusedbeforewithoutsayingthatwewereusingalinetocreateestimates.Itisthehorizontallineatthevalue"averageof ."Supposeyouwereaskedtoestimate andwerenottoldthevalueof ;thenyouwouldusetheaverageof asyour
ComputationalandInferentialThinking
234Errors
estimate,regardlessofthechapter.Inotherwords,youwouldusetheflatlinebelow.
Eacherrorthatyouwouldmakewouldthenbeadeviationfromaverage.TheroughsizeofthesedeviationsistheSDof .
Insummary,ifweusetheflatlineattheaverageof tomakeourestimates,theestimateswillbeoffbytheSDof .
#Errors:Deviationsfromtheflatlineattheaverageofy
characters_average=np.mean(little_women.column('Characters'))
lw_errors(0,characters_average)
LeastSquares¶
Ifyouuseanyarbitrarylineasyourlineofestimates,thensomeofyourerrorsarelikelytobepositiveandothersnegative.Toavoidcancellationwhenmeasuringtheroughsizeoftheerrors,wetakethemeanofthesqaurederrorsratherthanthemeanoftheerrorsthemselves.Thisisexactlyanalogoustoourreasonforlookingatsquareddeviationsfromaverage,whenwewerelearninghowtocalculatetheSD.
Themeansquarederrorofestimationusingastraightlineisameasureofroughlyhowbigthesquarederrorsare;takingthesquarerootyieldstherootmeansquareerror,whichisinthesameunitsas .
ComputationalandInferentialThinking
235Errors
Hereisaremarkablefactofmathematics:theregressionlineminimizesthemeansquarederrorofestimation(andhencealsotherootmeansquarederror)amongallstraightlines.Thatiswhytheregressionlineissometimescalledthe"leastsquaresline."
Computingthe"best"line.
Togetestimatesof basedon ,youcanuseanylineyouwant.Everylinehasameansquarederrorofestimation."Better"lineshavesmallererrors.Theregressionlineistheuniquestraightlinethatminimizesthemeansquarederrorofestimationamongallstraightlines.
WecandefineafunctiontocomputetherootmeansquarederrorofanylinethroughttheLittleWomenscatterdiagram.
deflw_rmse(slope,intercept):
lw_errors(slope,intercept)
x=little_women.column('Periods')
y=little_women.column('Characters')
fitted=slope*x+intercept
mse=np.average((y-fitted)**2)
print("Rootmeansquarederror:",mse**0.5)
lw_rmse(0,characters_average)
Rootmeansquarederror:7019.17593405
ComputationalandInferentialThinking
236Errors
Theerroroftheregressionlineisindeedmuchsmallerifwechooseaslopeandinterceptneartheregressionline.
lw_rmse(90,4000)
Rootmeansquarederror:2715.53910638
Buttheminimumisachievedusingtheregressionlineitself.
lw_rmse(lw_slope,lw_intercept)
ComputationalandInferentialThinking
237Errors
Rootmeansquarederror:2701.69078531
NumericalOptimization¶Wecanalsodefinemean_squared_errorforanarbitrarydataset.We'lluseahigher-orderfunctionsothatwecantrymanydifferentlinesonthesamedataset,simplybypassingintheirslopeandintercept.
defmean_squared_error(table,x,y):
deffor_line(slope,intercept):
fitted=(slope*table.column(x)+intercept)
returnnp.average((table.column(y)-fitted)**2)
returnfor_line
TheOldFaithfulgeyserinYellowstonenationalparkeruptsregularly,butthewaitingtimebetweeneruptions(inseconds)andthedurationoftheeruptionsinseccondsdovary.
faithful=Table.read_table('faithful.csv')
faithful.scatter(1,0)
ComputationalandInferentialThinking
238Errors
Itappearsthattherearetwotypesoferuptions,shortandlong.Theeruptionsabove3secondsappearcorrelatedwithwaitingtime.
long=faithful.where(faithful.column('eruptions')>3)
long.scatter(1,0,fit_line=True)
Thermse_geyserfunctiontakesaslopeandaninterceptandreturnstherootmeansquarederrorofalinearpredictoroferuptionsizefromthewaitingtime.
ComputationalandInferentialThinking
239Errors
mse_long=mean_squared_error(long,1,0)
mse_long(0,4)**0.5
0.50268461001194098
Ifweexperimentwithdifferentvalues,wecanfindalow-errorslopeandinterceptthroughtrialanderror.
mse_long(0.01,3.5)**0.5
0.39143872353883735
mse_long(0.02,2.7)**0.5
0.38168519564089542
Theminimizefunctioncanbeusedtofindtheargumentsofafunctionforwhichthefunctionreturnsitsminimumvalue.Pythonusesasimilartrial-and-errorapproach,followingthechangesthatleadtoincrementallyloweroutputvalues.
a,b=minimize(mse_long)
Therootmeansquarederroroftheminimalslopeandinterceptissmallerthananyofthevaluesweconsideredbefore.
mse_long(a,b)**0.5
0.38014612988504093
Infact,thesevaluesaandbarethesameasthevaluesreturnedbytheslopeandinterceptfunctionswedevelopedbasedonthecorrelationcoefficient.Weseesmalldeviationsduetotheinexactnatureofminimize,butthevaluesareessentiallythesame.
ComputationalandInferentialThinking
240Errors
print("slope:",slope(long,1,0))
print("a:",a)
print("intecept:",intercept(long,1,0))
print("b:",b)
slope:0.0255508940715
a:0.0255496
intecept:2.24752334164
b:2.247629
Therefore,wehavefoundnotonlythattheregressionlineminimizesmeansquarederror,butalsothatminimizingmeansquarederrorgivesustheregressionline.Theregressionlineistheonlylinethatminimizesmeansquarederror.
Residuals¶Theamountoferrorineachoftheseregressionestimatesisthedifferencebetweentheson'sheightanditsestimate.Theseerrorsarecalledresiduals.Someresidualsarepositive.Thesecorrespondtopointsthatareabovetheregressionline–pointsforwhichtheregressionlineunder-estimates .Negativeresidualscorrespondtothelineover-estimatingvaluesof .
waiting=long.column('waiting')
eruptions=long.column('eruptions')
fitted=a*waiting+b
res=long.with_columns([
'eruptions(fitted)',fitted,
'residuals',eruptions-fitted
])
res
ComputationalandInferentialThinking
241Errors
eruptions waiting eruptions(fitted) residuals
3.6 79 4.26605 -0.666047
3.333 74 4.1383 -0.805299
4.533 85 4.41934 0.113655
4.7 88 4.49599 0.204006
3.6 85 4.41934 -0.819345
4.35 85 4.41934 -0.069345
3.917 84 4.3938 -0.476795
4.2 78 4.2405 -0.0404978
4.7 83 4.36825 0.331754
4.8 84 4.3938 0.406205
...(165rowsomitted)
Aswithdeviationsfromaverage,thepositiveandnegativeresidualsexactlycanceleachotherout.Sotheaverage(andsum)oftheresidualsis0.Becausewefoundaandbbynumericallyminimizingtherootmeansquarederrorratherthancomputingthemexactly,thesumofresidualsisslightlydifferentfromzerointhiscase.
sum(res.column('residuals'))
-0.00037579999996140145
Aresidualplotisascatterplotoftheresidualsversusthefittedvalues.Theresidualplotofagoodregressionlooksliketheonebelow:aformlesscloudwithnopattern,centeredaroundthehorizontalaxis.Itshowsthatthereisnodiscerniblenon-linearpatternintheoriginalscatterplot.
res.scatter(2,3,fit_line=True)
ComputationalandInferentialThinking
242Errors
Bycontrast,supposewehadattemptedtofitaregressionlinetotheentiresetoferuptionsofOldFaithfulintheoriginaldataset.
faithful.scatter(0,1,fit_line=True)
Thislinedoesnotpassthroughtheaveragevalueofverticalstrips.Wemaybeabletoseethisfactfromthescatteralone,butitisdramaticallyclearbyvisualizingtheresidualscatter.
ComputationalandInferentialThinking
243Errors
defresidual_plot(table,x,y):
fitted=fit(table,x,y)
residuals=table.column(y)-fitted
res_table=Table().with_columns([
'fitted',fitted,
'residuals',residuals])
res_table.scatter(0,1,fit_line=True)
residual_plot(faithful,1,0)
Thediagonalstripesshowanon-linearpatterninthedata.Inthiscase,wewouldnotexpecttheregressionlinetoprovidelow-errorestimates.
Example:SATScores¶Residualplotscanbeusefulforspottingnon-linearityinthedata,orotherfeaturesthatweakentheregressionanalysis.Forexample,considertheSATdataoftheprevioussection,andsupposeyoutrytoestimatetheCombinedscorebasedonParticipationRate.
sat=Table.read_table('sat2014.csv')
sat
ComputationalandInferentialThinking
244Errors
State ParticipationRate
CriticalReading Math Writing Combined
NorthDakota 2.3 612 620 584 1816
Illinois 4.6 599 616 587 1802
Iowa 3.1 605 611 578 1794
SouthDakota 2.9 604 609 579 1792
Minnesota 5.9 598 610 578 1786
Michigan 3.8 593 610 581 1784
Wisconsin 3.9 596 608 578 1782
Missouri 4.2 595 597 579 1771
Wyoming 3.3 590 599 573 1762
Kansas 5.3 591 596 566 1753
...(41rowsomitted)
sat.scatter('ParticipationRate','Combined',s=8)
Therelationbetweenthevariablesisclearlynon-linear,butyoumightbetemptedtofitastraightlineanyway,especiallyifyouneverlookedatascatterdiagramofthedata.
ComputationalandInferentialThinking
245Errors
sat.scatter('ParticipationRate','Combined',s=8,fit_line=True)
Thepointsinthescatterplotstartoutabovetheregressionline,thenareconsistentlybelowtheline,thenabove,thenbelow.Thispatternofnon-linearityismoreclearlyvisibleintheresidualplot.
residual_plot(sat,'ParticipationRate','Combined')
_=plt.title('Residualplotofabadregression')
ComputationalandInferentialThinking
246Errors
Thisresidualplotisnotaformlesscloud;itshowsanon-linearpattern,andisasignalthatlinearregressionshouldnothavebeenusedforthesedata.
TheSizeofResiduals¶
Wewillconcludebyobservingseveralrelationshipsbetweenquantitieswehavealreadyidentified.We'llillustratetherelationshipsusingthelongeruptionsofOldFaithful,buttheyindeedholdforanycollectionofpairedobservations.
Fact1:Therootmeansquarederrorofalinethathaszeroslopeandaninterceptattheaverageofyisthestandarddeviationofy.
mse_long(0,np.average(eruptions))**0.5
0.40967604587437784
np.std(eruptions)
0.40967604587437784
Bycontrast,thestandarddeviationoffittedvaluesissmaller.
eruptions_fitted=fit(long,'waiting','eruptions')
np.std(eruptions_fitted)
0.15271994814407569
Fact2:Theratioofthestandarddeviationsoffittedvaluestoyvaluesis .
r=correlation(long,'waiting','eruptions')
print('r:',r)
print('ratio:',np.std(eruptions_fitted)/np.std(eruptions))
r:0.372782225571
ratio:0.372782225571
ComputationalandInferentialThinking
247Errors
Noticetheabsolutevalueof intheformulaabove.Fortheheightsoffathersandsons,thecorrelationispositiveandsothereisnodifferencebetweenusing andusingitsabsolutevalue.However,theresultistrueforvariablesthathavenegativecorrelationaswell,providedwearecarefultousetheabsolutevalueof insteadof .
Theregressionlinedoesabetterjobofestimatingeruptionlengthsthanazero-slopeline.Thus,theroughsizeoftheerrorsmadeusingtheregressionlinemustbesmallerthatthatusingaflatline.Inotherwords,theSDoftheresidualsmustbesmallerthantheoverallSDof .
eruptions_residuals=eruptions-eruptions_fitted
np.std(eruptions_residuals)
0.38014612980028634
Fact3:TheSDoftheresidualsis timestheSDof .
np.std(eruptions)*(1-r**2)**0.5
0.38014612980028639
Fact4:Thevarianceofthefittedvaluesplusthevarianceoftheresidualsgivesthevarianceofy.
np.std(eruptions_fitted)**2+np.std(eruptions_residuals)**2
0.16783446256326531
np.std(eruptions)**2
0.16783446256326534
ComputationalandInferentialThinking
248Errors
MultipleRegressionInteractNowthatwehaveexploredwaystousemultiplefeaturestopredictacategoricalvariable,itisnaturaltostudywaysofusingmultiplepredictorvariablestopredictaquantitativevariable.Acommonlyusedmethodtodothisiscalledmultipleregression.
Wewillstartwithanexampletoreviewsomefundamentalaspectsofsimpleregression,thatis,regressionbasedononepredictorvariable.
Example:SimpleRegression¶
Supposethatourgoalistouseregressiontoestimatetheheightofabassethoundbasedonitsweight,usingasamplethatlooksconsistentwiththeregressionmodel.Supposetheobservedcorrelation is0.5,andthatthesummarystatisticsforthetwovariablesareasinthetablebelow:
average SD
height 14inches 2inches
weight 50pounds 5pounds
Tocalculatetheequationoftheregressionline,weneedtheslopeandtheintercept.
Theequationoftheregressionlineallowsustocalculatetheestimatedheight,ininches,basedonagivenweightinpounds:
Theslopeofthelineismeasurestheincreaseintheestimatedheightperunitincreaseinweight.Theslopeispositive,anditisimportanttonotethatthisdoesnotmeanthatwethinkbassethoundsgettalleriftheyputonweight.Theslopereflectsthedifferenceintheaverageheightsoftwogroupsofdogsthatare1poundapartinweight.Specifically,
ComputationalandInferentialThinking
249MultipleRegression
consideragroupofdogswhoseweightis pounds,andthegroupwhoseweightispounds.Thesecondgroupisestimatedtobe0.2inchestaller,onaverage.Thisistrueforallvaluesof inthesample.
Ingeneral,theslopeoftheregressionlinecanbeinterpretedastheaverageincreaseinperunitincreasein .Notethatiftheslopeisnegative,thenforeveryunitincreasein ,theaverageof decreases.
MultiplePredictors¶
Inmultipleregression,morethanonepredictorvariableisusedtoestimate .Forexample,aDartmouthstudyofundergraduatescollectedinformationabouttheiruseofanonlinecourseforumandtheirGPA.Wemightwanttopredictastudent'sGPAbasedonthenumberofdaysthattheyvisitthecourseforumandhowmanytimestheyanswersomeoneelse'squestion.Thenthemultipleregressionmodelwouldinvolvetwoslopes,anintercept,andrandomerrorsasbefore:
GPA
OurgoalwouldbetofindtheestimatedGPAusingthebestestimatesofthetwoslopesandtheintercept;asbefore,the"best"estimatesarethosethatminimizethemeansquarederrorofestimation.
Tostartoff,wewillinvestigatethedataset,whichisdescribedonline.Eachrowrepresentsastudent.Inadditiontothestudent'sGPA,therowtallies
daysonline:Thenumberofdaysonwhichthestudentviewedtheonlineforumviews:Thenumberofpostsviewedcontributions:Thenumberofcontributions,includingpostsandfollow-updiscussionsquestions:Thenumberofquestionspostednotes:Thenumberofnotespostedanswers:Thenumberofanswersposted
grades=Table.read_table('grades_and_piazza.csv')
grades
ComputationalandInferentialThinking
250MultipleRegression
GPA daysonline views contributions questions notes answers
2.863 29 299 5 1 1 0
3.505 57 299 0 0 0 0
3.029 27 101 1 1 0 0
3.679 67 301 1 0 0 0
3.474 43 201 12 1 0 0
3.705 67 308 45 22 0 5
3.806 36 171 20 4 3 4
3.667 82 300 26 11 0 3
3.245 44 127 6 1 1 0
3.293 35 259 16 13 1 0
...(20rowsomitted)
CorrelationMatrix¶
PerhapswewishtopredictGPAbasedonforumusage.AnaturalfirststepistoseewhichvariablesarecorrelatedwiththeGPA.Hereisthecorrelationmatrix.Theto_dfmethodgeneratesaPandasdataframecontainingthesamedataasthetable,anditscorrmethodgeneratesamatrixofcorrelationsforeachpairofcolumns.
AllforumusagevariablesarecorrelatedwithGPA,butthenotescorrelationhasaverysmallmagnitude.
grades.to_df().corr()
GPA daysonline views contributions questions
GPA 1.000000 0.684905 0.444175 0.427897 0.409212
daysonline 0.684905 1.000000 0.654557 0.448319 0.435269
views 0.444175 0.654557 1.000000 0.426406 0.361002
contributions 0.427897 0.448319 0.426406 1.000000 0.857981
questions 0.409212 0.435269 0.361002 0.857981 1.000000
notes -0.160604 -0.230839 0.065627 0.295661 -0.006365
answers 0.440382 0.502810 0.365010 0.702679 0.515661
ComputationalandInferentialThinking
251MultipleRegression
Tostartoff,letusperformthesimpleregressionofGPAonjustdaysonline,thecountwiththestrongestcorrelationwithanrof0.68.Hereisthescatterdiagramandregressionline.
grades.scatter('daysonline','GPA',fit_line=True)
Wecanimmediatelyseethattherelationisnotentirelylinear,andaresidualplothighlightsthisfact.OneissueisthattheGPAisneverabove4.0,sothemodelassumptionthatyvaluesarebellshapeddoesnotholdinthiscase.
residual_plot(grades,'daysonline','GPA')
Nonetheless,theregressionlinedoescapturesomeofthepatterinthedata,andsowewillcontinuetoworkwithit,notingthatourpredictionsmaynotbeentirelyjustifiedbytheregressionmodel.
ComputationalandInferentialThinking
252MultipleRegression
Therootmeansquarederrorisabitabove1/4ofaGPApointwhenpredictingGPAfromonlythedaysonline.
grades_days_mse=mean_squared_error(grades,'daysonline','GPA')
a,b=minimize(grades_days_mse)
grades_days_mse(a,b)**0.5
0.28494512878662448
TwoPredictors¶
Perhapsmoreinformationcanimproveourprediction.ThenumberofanswersisalsocorrelatedwithGPA.Whenusingitaloneasapredictor,therootmeansquarederrorisgreaterbecausethecorrelationhaslowermagnitudethandaysonline.
grades_answers_mse=mean_squared_error(grades,'answers','GPA')
a,b=minimize(grades_answers_mse)
grades_answers_mse(a,b)**0.5
0.35110518920562062
However,thetwovariablescanbeusedtogether.First,wedefinethemeansquarederrorintermsofalinethathastwoslopes,oneforeachvariable,aswellasanintercept.
defgrades_days_answers_mse(a_days,a_answers,b):
fitted=a_days*grades.column('daysonline')+\
a_answers*grades.column('answers')+\
b
y=grades.column('GPA')
returnnp.average((y-fitted)**2)
minimize(grades_days_answers_mse)
array([0.0157683,0.0301254,2.6031789])
ComputationalandInferentialThinking
253MultipleRegression
Theminimizefunctionreturnsthreevalues,theslopefordaysonline,theslopeforanswers,andtheintercept.Together,thesethreevaluesdescribealinearpredictionfunction.Forexample,someonewhospent50daysonlineandanswered5questionsispredictedtohaveaGPAofabout3.54.
np.round(0.0157683*50+0.0301254*5+2.6031789,2)
3.54
Therootmeansquarederrorofthispredictorisabitsmallerthanthatofeithersingle-variablepredictorwehadbefore!
a_days,a_answers,b=minimize(grades_days_answers_mse)
grades_days_answers_mse(a_days,a_answers,b)**0.5
0.28161529539241298
Combiningallvariables¶
Wecancombineallinformationaboutthecourseforuminordertoattempttofindanevenbetterpredictor.
ComputationalandInferentialThinking
254MultipleRegression
deffit_grades(a_days,a_views,a_contributions,a_questions,a_notes
returna_days*grades.column('daysonline')+\
a_views*grades.column('views')+\
a_contributions*grades.column('contributions')+\
a_questions*grades.column('questions')+\
a_notes*grades.column('notes')+\
a_answers*grades.column('answers')+\
b
defgrades_all_mse(a_days,a_views,a_contributions,a_questions,a_notes
fitted=fit_grades(a_days,a_views,a_contributions,a_questions
y=grades.column('GPA')
returnnp.average((y-fitted)**2)
minimize(grades_all_mse)
array([1.43703000e-02,-8.15000000e-05,6.50720000e-03,
-1.72780000e-03,-4.04014000e-02,1.59645000e-02,
2.65375470e+00])
Usingallofthisinformation,wehaveconstructedapredictorwithanevensmallerrootmeansquarederror.The*belowpassesallofthereturnvaluesoftheminimizefunctionasargumentstogrades_all_mse.
grades_all_mse(*minimize(grades_all_mse))**0.5
0.27783237243108722
Allofthisinformationdidnotimproveourpredictorverymuch.Therootmeansquarederrordecreasedfrom2.82forthebestsingle-variablepredictorto2.78whenusingalloftheavailableinformation.
Definitionof ,consistentwithourold
Whenwestudiedsimpleregression,wehadnotedthat
ComputationalandInferentialThinking
255MultipleRegression
Letususeouroldfunctionstocomputethefittedvaluesandconfirmthatthisistrueforourexample:
fitted=fit(grades,'daysonline','GPA')
np.std(fitted)/np.std(grades.column('GPA'))
0.68490480299828194
Becausevarianceisthesquareofthestandarddeviation,wecansaythat
Noticethatthiswayofthinkingabout involvesonlytheestimatedvaluesandtheobservedvalues,notthenumberofpredictorvariables.Therefore,itmotivatesthedefinitionofmultiple :
Itisafactofmathematicsthatthisquantityisalwaysbetween0and1.Withmultiplepredictorvariables,thereisnoclearinterpretationofasignattachedtothesquarerootof
.Someofthepredictorsmightbepositivelyassociatedwith ,othersnegatively.Anoverallmeasureofthefitisprovidedby .
fitted=fit_grades(*minimize(grades_all_mse))
np.var(fitted)/np.var(grades.column('GPA'))
0.49525855565521787
Itisnotalwayswisetomaximize byincludingmanyvariables,butwewilladdressthisconcernlaterinthetext.
Wecanalsoviewtheresidualplotofmultipleregression,justasbefore.Thehorizontalaxisshowsfittedvaluesandtheverticalaxisshowserrors.Again,thereissomestructuretothisplot,soperhapstherelationbetweenthesevariablesisnotentirelylinear.Inthiscase,thestructureisminorenoughthattheregressionmodelisareasonablechoiceofpredictor.
ComputationalandInferentialThinking
256MultipleRegression
Table().with_columns([
'fitted',fitted,
'errors',grades.column('GPA')-fitted
]).scatter(0,1,fit_line=True)
ComputationalandInferentialThinking
257MultipleRegression
ClassificationSectionauthor:DavidWagner
InteractThissectionwilldiscussmachinelearning.Machinelearningisaclassoftechniquesforautomaticallyfindingpatternsindataandusingittodrawinferencesormakepredictions.We'regoingtofocusonaparticularkindofmachinelearning,namely,classification.
Classificationisaboutlearninghowtomakepredictionsfrompastexamples:we'regivensomeexampleswherewehavebeentoldwhatthecorrectpredictionwas,andwewanttolearnfromthoseexampleshowtomakegoodpredictionsinthefuture.Hereareafewapplicationswhereclassificationisusedinpractice:
ForeachorderAmazonreceives,Amazonwouldliketopredict:isthisorderfraudulent?Theyhavesomeinformationabouteachorder(e.g.,itstotalvalue,whethertheorderisbeingshippedtoanaddressthiscustomerhasusedbefore,whethertheshippingaddressisthesameasthecreditcardholder'sbillingaddress).Theyhavelotsofdataonpastorders,andtheyknowwhetherwhichofthosepastorderswerefraudulentandwhichweren't.Theywanttolearnpatternsthatwillhelpthempredict,asnewordersarrive,whetherthosenewordersarefraudulent.
Onlinedatingsiteswouldliketopredict:arethesetwopeoplecompatible?Willtheyhititoff?Theyhavelotsofdataonwhichmatchesthey'vesuggestedtotheircustomersinthepast,andtheyhavesomeideawhichonesweresuccessful.Asnewcustomerssignup,they'dliketopredictmakepredictionsaboutwhomightbeagoodmatchforthem.
Doctorswouldliketoknow:doesthispatienthavecancer?Basedonthemeasurementsfromsomelabtest,they'dliketobeabletopredictwhethertheparticularpatienthascancer.Theyhavelotsofdataonpastpatients,includingtheirlabmeasurementsandwhethertheyultimatelydevelopedcancer,andfromthat,they'dliketotrytoinferwhatmeasurementstendtobecharacteristicofcancer(ornon-cancer)sotheycandiagnosefuturepatientsaccurately.
Politicianswouldliketopredict:areyougoingtovoteforthem?Thiswillhelpthemfocusfundraisingeffortsonpeoplewhoarelikelytosupportthem,andfocusget-out-the-voteeffortsonvoterswhowillvoteforthem.Publicdatabasesandcommercialdatabaseshavealotofinformationaboutmostpeople:e.g.,whethertheyownahomeorrent;whethertheyliveinarichneighborhoodorpoorneighborhood;theirinterestsandhobbies;theirshoppinghabits;andsoon.Andpoliticalcampaignshavesurveyed
ComputationalandInferentialThinking
258Classification
somevotersandfoundoutwhotheyplantovotefor,sotheyhavesomeexampleswherethecorrectanswerisknown.Fromthisdata,thecampaignswouldliketofindpatternsthatwillhelpthemmakepredictionsaboutallotherpotentialvoters.
Alloftheseareclassificationtasks.Noticethatineachoftheseexamples,thepredictionisayes/noquestion--wecallthisbinaryclassification,becausethereareonlytwopossiblepredictions.Inaclassificationtask,wehaveabunchofobservations.Eachobservationrepresentsasingleindividualorasinglesituationwherewe'dliketomakeaprediction.Eachobservationhasmultipleattributes,whichareknown(e.g.,thetotalvalueoftheorder;voter'sannualsalary;andsoon).Also,eachobservationhasaclass,whichistheanswertothequestionwecareabout(e.g.,yesorno;fraudulentornot;etc.).
Forinstance,withtheAmazonexample,eachordercorrespondstoasingleobservation.Eachobservationhasseveralattributes(e.g.,thetotalvalueoftheorder,whethertheorderisbeingshippedtoanaddressthiscustomerhasusedbefore,andsoon).Theclassoftheobservationiseither0or1,where0meansthattheorderisnotfraudulentand1meansthattheorderisfraudulent.Giventheattributesofsomeneworder,wearetryingtopredictitsclass.
Classificationrequiresdata.Itinvolveslookingforpatterns,andtofindpatterns,youneeddata.That'swherethedatasciencecomesin.Inparticular,we'regoingtoassumethatwehaveaccesstotrainingdata:abunchofobservations,whereweknowtheclassofeachobservation.Thecollectionofthesepre-classifiedobservationsisalsocalledatrainingset.Aclassificationalgorithmisgoingtoanalyzethetrainingset,andthencomeupwithaclassifier:analgorithmforpredictingtheclassoffutureobservations.
Notethatclassifiersdonotneedtobeperfecttobeuseful.Theycanbeusefuleveniftheiraccuracyislessthan100%.Forinstance,iftheonlinedatingsiteoccasionallymakesabadrecommendation,that'sOK;theircustomersalreadyexpecttohavetomeetmanypeoplebeforethey'llfindsomeonetheyhititoffwith.Ofcourse,youdon'twanttheclassifiertomaketoomanyerrors--butitdoesn'thavetogettherightanswereverysingletime.
Chronickidneydisease¶
Let'sworkthroughanexample.We'regoingtoworkwithadatasetthatwascollectedtohelpdoctorsdiagnosechronickidneydisease(CKD).Eachrowinthedatasetrepresentsasinglepatientwhowastreatedinthepastandwhosediagnosisisknown.Foreachpatient,wehaveabunchofmeasurementsfromabloodtest.We'dliketofindwhichmeasurementsaremostusefulfordiagnosingCKD,anddevelopawaytoclassifyfuturepatientsas"hasCKD"or"doesn'thaveCKD"basedontheirbloodtestresults.
Let'sloadthedatasetintoatableandlookatit.
ComputationalandInferentialThinking
259Classification
ckd=Table.read_table('ckd.csv').relabeled('BloodGlucoseRandom',
ckd
Age BloodPressure
SpecificGravity Albumin Sugar
RedBloodCells
PusCell PusCellclumps
48 70 1.005 4 0 normal abnormal present
53 90 1.02 2 0 abnormal abnormal present
63 70 1.01 3 0 abnormal abnormal present
68 80 1.01 3 2 normal abnormal present
61 80 1.015 2 0 abnormal abnormal notpresent
48 80 1.025 4 0 normal abnormal notpresent
69 70 1.01 3 4 normal abnormal notpresent
73 70 1.005 0 0 normal normal notpresent
73 80 1.02 2 0 abnormal abnormal notpresent
46 60 1.01 1 0 normal normal notpresent
...(148rowsomitted)
Wehavedataon158patients.Thereareanawfullotofattributeshere.Thecolumnlabelled"Class"indicateswhetherthepatientwasdiagnosedwithCKD:1meanstheyhaveCKD,0meanstheydonothaveCKD.
Let'slookattwocolumnsinparticular:thehemoglobinlevel(inthepatient'sblood),andthebloodglucoselevel(atarandomtimeintheday;withoutfastingspeciallyforthebloodtest).We'lldrawascatterplot,tomakeiteasytovisualizethis.ReddotsarepatientswithCKD;bluedotsarepatientswithoutCKD.WhattestresultsseemtoindicateCKD?
ckd.scatter('Hemoglobin','Glucose',c=ckd.column('Class'))
ComputationalandInferentialThinking
260Classification
SupposeAliceisanewpatientwhoisnotinthedataset.IfItellyouAlice'shemoglobinlevelandbloodglucoselevel,couldyoupredictwhethershehasCKD?Itsurelookslikeit!Youcanseeaveryclearpatternhere:pointsinthelower-righttendtorepresentpeoplewhodon'thaveCKD,andtheresttendtobefolkswithCKD.Toahuman,thepatternisobvious.Buthowcanweprogramacomputertoautomaticallydetectpatternssuchasthisone?
Well,therearelotsofkindsofpatternsonemightlookfor,andlotsofalgorithmsforclassification.ButI'mgoingtotellyouaboutonethatturnsouttobesurprisinglyeffective.Itiscallednearestneighborclassification.Here'stheidea.IfwehaveAlice'shemoglobinandglucosenumbers,wecanputhersomewhereonthisscatterplot;thehemoglobinisherx-coordinate,andtheglucoseishery-coordinate.Now,topredictwhethershehasCKDornot,wefindthenearestpointinthescatterplotandcheckwhetheritisredorblue;wepredictthatAliceshouldreceivethesamediagnosisasthatpatient.
Inotherwords,toclassifyAliceasCKDornot,wefindthepatientinthetrainingsetwhois"nearest"toAlice,andthenusethatpatient'sdiagnosisasourpredictionforAlice.Theintuitionisthatiftwopointsareneareachotherinthescatterplot,thenthecorrespondingmeasurementsareprettysimilar,sowemightexpectthemtoreceivethesamediagnosis(morelikelythannot).Wedon'tknowAlice'sdiagnosis,butwedoknowthediagnosisofallthepatientsinthetrainingset,sowefindthepatientinthetrainingsetwhoismostsimilartoAlice,andusethatpatient'sdiagnosistopredictAlice'sdiagnosis.
Thescatterplotsuggeststhatthisnearestneighborclassifiershouldbeprettyaccurate.Pointsinthelower-rightwilltendtoreceivea"noCKD"diagnosis,astheirnearestneighborwillbeabluepoint.Therestofthepointswilltendtoreceivea"CKD"diagnosis,astheirnearestneighborwillbearedpoint.Sothenearestneighborstrategyseemstocaptureourintuitionprettywell,forthisexample.
ComputationalandInferentialThinking
261Classification
However,theseparationbetweenthetwoclasseswon'talwaysbequitesoclean.Forinstance,supposethatinsteadofhemoglobinlevelsweweretolookatwhitebloodcellcount.Lookatwhathappens:
ckd.scatter('WhiteBloodCellCount','Glucose',c=ckd.column('Class'
Asyoucansee,non-CKDindividualsareallclusteredinthelower-left.MostofthepatientswithCKDareaboveortotherightofthatcluster...butnotall.TherearesomepatientswithCKDwhoareinthelowerleftoftheabovefigure(asindicatedbythehandfulofreddotsscatteredamongthebluecluster).Whatthismeansisthatyoucan'ttellforcertainwhethersomeonehasCKDfromjustthesetwobloodtestmeasurements.
IfwearegivenAlice'sglucoselevelandwhitebloodcellcount,canwepredictwhethershehasCKD?Yes,wecanmakeaprediction,butweshouldn'texpectittobe100%accurate.Intuitively,itseemslikethere'sanaturalstrategyforpredicting:plotwhereAlicelandsinthescatterplot;ifsheisinthelower-left,predictthatshedoesn'thaveCKD,otherwisepredictshehasCKD.Thisisn'tperfect--ourpredictionswillsometimesbewrong.(Takeaminuteandthinkitthrough:forwhichpatientswillitmakeamistake?)Asthescatterplotaboveindicates,sometimespeoplewithCKDhaveglucoseandwhitebloodcelllevelsthatlookidenticaltothoseofsomeonewithoutCKD,soanyclassifierisinevitablygoingtomakethewrongpredictionforthem.
Canweautomatethisonacomputer?Well,thenearestneighborclassifierwouldbeareasonablechoiceheretoo.Takeaminuteandthinkitthrough:howwillitspredictionscomparetothosefromtheintuitivestrategyabove?Whenwilltheydiffer?Itspredictionswillbeprettysimilartoourintuitivestrategy,butoccasionallyitwillmakeadifferentprediction.In
ComputationalandInferentialThinking
262Classification
particular,ifAlice'sbloodtestresultshappentoputherrightnearoneofthereddotsinthelower-left,theintuitivestrategywouldpredict"notCKD",whereasthenearestneighborclassifierwillpredict"CKD".
Thereisasimplegeneralizationofthenearestneighborclassifierthatfixesthisanomaly.Itiscalledthek-nearestneighborclassifier.TopredictAlice'sdiagnosis,ratherthanlookingatjusttheoneneighborclosesttoher,wecanlookatthe3pointsthatareclosesttoher,andusethediagnosisforeachofthose3pointstopredictAlice'sdiagnosis.Inparticular,we'llusethemajorityvalueamongthose3diagnosesasourpredictionforAlice'sdiagnosis.Ofcourse,there'snothingspecialaboutthenumber3:wecoulduse4,or5,ormore.(It'softenconvenienttopickanoddnumber,sothatwedon'thavetodealwithties.)Ingeneral,wepickanumber ,andourpredicteddiagnosisforAliceisbasedonthe patientsinthetrainingsetwhoareclosesttoAlice.Intuitively,thesearethe patientswhosebloodtestresultsweremostsimilartoAlice,soitseemsreasonabletousetheirdiagnosestopredictAlice'sdiagnosis.
The -nearestneighborclassifierwillnowbehavejustlikeourintuitivestrategyabove.
Decisionboundary¶
Sometimesahelpfulwaytovisualizeaclassifieristomaptheregionofspacewheretheclassifierwouldpredict'CKD',andtheregionofspacewhereitwouldpredict'notCKD'.Weendupwithsomeboundarybetweenthetwo,wherepointsononesideoftheboundarywillbeclassified'CKD'andpointsontheothersidewillbeclassified'notCKD'.Thisboundaryiscalledthedecisionboundary.Eachdifferentclassifierwillhaveadifferentdecisionboundary;thedecisionboundaryisjustawaytovisualizewhatcriteriatheclassifierisusingtoclassifypoints.
Banknoteauthentication¶
Let'sdoanotherexample.Thistimewe'lllookatpredictingwhetherabanknote(e.g.,a$20bill)iscounterfeitorlegitimate.Researchershaveputtogetheradatasetforus,basedonphotographsofmanyindividualbanknotes:somecounterfeit,somelegitimate.Theycomputedafewnumbersfromeachimage,usingtechniquesthatwewon'tworryaboutforthiscourse.So,foreachbanknote,weknowafewnumbersthatwerecomputedfromaphotographofitaswellasitsclass(whetheritiscounterfeitornot).Let'sloaditintoatableandtakealook.
banknotes=Table.read_table('banknote.csv')
banknotes
ComputationalandInferentialThinking
263Classification
WaveletVar WaveletSkew WaveletCurt Entropy Class
3.6216 8.6661 -2.8073 -0.44699 0
4.5459 8.1674 -2.4586 -1.4621 0
3.866 -2.6383 1.9242 0.10645 0
3.4566 9.5228 -4.0112 -3.5944 0
0.32924 -4.4552 4.5718 -0.9888 0
4.3684 9.6718 -3.9606 -3.1625 0
3.5912 3.0129 0.72888 0.56421 0
2.0922 -6.81 8.4636 -0.60216 0
3.2032 5.7588 -0.75345 -0.61251 0
1.5356 9.1772 -2.2718 -0.73535 0
...(1362rowsomitted)
Let'slookatwhetherthefirsttwonumberstellusanythingaboutwhetherthebanknoteiscounterfeitornot.Here'sascatterplot:
banknotes.scatter('WaveletVar','WaveletCurt',c=banknotes.column('Class'
Prettyinteresting!Thosetwomeasurementsdoseemhelpfulforpredictingwhetherthebanknoteiscounterfeitornot.However,inthisexampleyoucannowseethatthereissomeoverlapbetweentheblueclusterandtheredcluster.Thisindicatesthattherewillbesome
ComputationalandInferentialThinking
264Classification
imageswhereit'shardtotellwhetherthebanknoteislegitimatebasedonjustthesetwonumbers.Still,youcouldusea -nearestneighborclassifiertopredictthelegitimacyofabanknote.
Takeaminuteandthinkitthrough:Supposeweused (say).Whatpartsoftheplotwouldtheclassifiergetright,andwhatpartswoulditmakeerrorson?Whatwouldthedecisionboundarylooklike?
Thepatternsthatshowupinthedatacangetprettywild.Forinstance,here'swhatwe'dgetifusedadifferentpairofmeasurementsfromtheimages:
banknotes.scatter('WaveletSkew','Entropy',c=banknotes.column('Class'
Theredoesseemtobeapattern,butit'saprettycomplexone.Nonetheless,the -nearestneighborsclassifiercanstillbeusedandwilleffectively"discover"patternsoutofthis.Thisillustrateshowpowerfulmachinelearningcanbe:itcaneffectivelytakeadvantageofevenpatternsthatwewouldnothaveanticipated,orthatwewouldhavethoughtto"programinto"thecomputer.
Multipleattributes¶
SofarI'vebeenassumingthatwehaveexactly2attributesthatwecanusetohelpusmakeourprediction.Whatifwehavemorethan2?Forinstance,whatifwehave3attributes?
ComputationalandInferentialThinking
265Classification
Here'sthecoolpart:youcanusethesameideasforthiscase,too.Allyouhavetodoismakea3-dimensionalscatterplot,insteadofa2-dimensionalplot.Youcanstillusethe -nearestneighborsclassifier,butnowcomputingdistancesin3dimensionsinsteadofjust2.Itjustworks.Verycool!
Infact,there'snothingspecialabout2or3.Ifyouhave4attributes,youcanusethe -nearestneighborsclassifierin4dimensions.5attributes?Workin5-dimensionalspace.Andnoneedtostopthere!Thisallworksforarbitrarilymanyattributes;youjustworkinaveryhighdimensionalspace.Itgetswicked-impossibletovisualize,butthat'sOK.Thecomputeralgorithmgeneralizesverynicely:allyouneedistheabilitytocomputethedistance,andthat'snothard.Mind-blowingstuff!
Forinstance,let'sseewhathappensifwetrytopredictwhetherabanknoteiscounterfeitornotusing3ofthemeasurements,insteadofjust2.Here'swhatyouget:
ax=plt.figure(figsize=(8,8)).add_subplot(111,projection='3d')
ax.scatter(banknotes.column('WaveletSkew'),
banknotes.column('WaveletVar'),
banknotes.column('WaveletCurt'),
c=banknotes.column('Class'))
<mpl_toolkits.mplot3d.art3d.Path3DCollectionat0x1093bdef0>
ComputationalandInferentialThinking
266Classification
Awesome!Withjust2attributes,therewassomeoverlapbetweenthetwoclusters(whichmeansthattheclassifierwasboundtomakesomemistakesforpointersintheoverlap).Butwhenweusethese3attributes,thetwoclustershavealmostnooverlap.Inotherwords,aclassifierthatusesthese3attributeswillbemoreaccuratethanonethatonlyusesthe2attributes.
Thisisageneralphenomenominclassification.Eachattributecanpotentiallygiveyounewinformation,somoreattributessometimeshelpsyoubuildabetterclassifier.Ofcourse,thecostisthatnowwehavetogathermoreinformationtomeasurethevalueofeachattribute,butthiscostmaybewellworthitifitsignificantlyimprovestheaccuracyofourclassifier.
Tosumup:younowknowhowtouse -nearestneighborclassificationtopredicttheanswertoayes/noquestion,basedonthevaluesofsomeattributes,assumingyouhaveatrainingsetwithexampleswherethecorrectpredictionisknown.Thegeneralroadmapisthis:
1. identifysomeattributesthatyouthinkmighthelpyoupredicttheanswertothequestion;2. gatheratrainingsetofexampleswhereyouknowthevaluesoftheattributesaswellas
thecorrectprediction;3. tomakepredictionsinthefuture,measurethevalueoftheattributesandthenuse -
nearestneighborclassificationtopredicttheanswertothequestion.
ComputationalandInferentialThinking
267Classification
Breastcancerdiagnosis¶NowIwanttodoamoreextendedexamplebasedondiagnosingbreastcancer.IwasinspiredbyBrittanyWenger,whowontheGooglenationalsciencefairthreeyearsagoasa17-yearoldhighschoolstudent.Here'sBrittany:
Brittany'ssciencefairprojectwastobuildaclassificationalgorithmtodiagnosebreastcancer.Shewongrandprizeforbuildinganalgorithmwhoseaccuracywasalmost99%.
Let'sseehowwellwecando,withtheideaswe'velearnedinthiscourse.
So,letmetellyoualittlebitaboutthedataset.Basically,ifawomanhasalumpinherbreast,thedoctorsmaywanttotakeabiopsytoseeifitiscancerous.Thereareseveraldifferentproceduresfordoingthat.Brittanyfocusedonfineneedleaspiration(FNA),becauseitislessinvasivethanthealternatives.Thedoctorgetsasampleofthemass,putsitunderamicroscope,takesapicture,andatrainedlabtechanalyzesthepicturetodeterminewhetheritiscancerornot.Wegetapicturelikeoneofthefollowing:
ComputationalandInferentialThinking
268Classification
Unfortunately,distinguishingbetweenbenignvsmalignantcanbetricky.So,researchershavestudiedusingmachinelearningtohelpwiththistask.Theideaisthatwe'llaskthelabtechtoanalyzetheimageandcomputevariousattributes:thingslikethetypicalsizeofacell,howmuchvariationthereisamongthecellsizes,andsoon.Then,we'lltrytousethisinformationtopredict(classify)whetherthesampleismalignantornot.Wehaveatrainingsetofpastsamplesfromwomenwherethecorrectdiagnosisisknown,andwe'llhopethatourmachinelearningalgorithmcanusethosetolearnhowtopredictthediagnosisforfuturesamples.
Weendupwiththefollowingdataset.Forthe"Class"column,1meansmalignant(cancer);0meansbenign(notcancer).
patients=Table.read_table('breast-cancer.csv').drop('ID')
patients
ComputationalandInferentialThinking
269Classification
ClumpThickness
UniformityofCellSize
UniformityofCellShape
MarginalAdhesion
SingleEpithelialCellSize
BareNuclei
BlandChromatin
5 1 1 1 2 1 3
5 4 4 5 7 10 3
3 1 1 1 2 2 3
6 8 8 1 3 4 3
4 1 1 3 2 1 3
8 10 10 8 7 10 9
1 1 1 1 2 10 3
2 1 2 1 2 1 3
2 1 1 1 2 1 1
4 2 1 1 2 1 2
...(673rowsomitted)
Sowehave9differentattributes.Idon'tknowhowtomakea9-dimensionalscatterplotofallofthem,soI'mgoingtopicktwoandplotthem:
patients.scatter('BlandChromatin','SingleEpithelialCellSize',
ComputationalandInferentialThinking
270Classification
Oops.Thatplotisutterlymisleading,becausethereareabunchofpointsthathaveidenticalvaluesforboththex-andy-coordinates.Tomakeiteasiertoseeallthedatapoints,I'mgoingtoaddalittlebitofrandomjittertothex-andy-values.Here'showthatlooks:
defrandomize_column(a):
returna+np.random.normal(0.0,0.09,size=len(a))
Table().with_columns([
'BlandChromatin(jittered)',
randomize_column(patients.column('BlandChromatin')),
'SingleEpithelialCellSize(jittered)',
randomize_column(patients.column('SingleEpithelialCellSize'
]).scatter(0,1,c=patients.column('Class'))
Forinstance,youcanseetherearelotsofsampleswithchromatin=2andepithelialcellsize=2;allnon-cancerous.
Keepinmindthatthejitteringisjustforvisualizationpurposes,tomakeiteasiertogetafeelingforthedata.Whenwewanttoworkwiththedata,we'llusetheoriginal(unjittered)data.
Applyingthek-nearestneighborclassifiertobreastcancerdiagnosis¶
We'vegotadataset.Let'stryoutthe -nearestneighborclassifierandseehowitdoes.Thisisgoingtobegreat.
ComputationalandInferentialThinking
271Classification
We'regoingtoneedanimplementationofthe -nearestneighborclassifier.Inpracticeyouwouldprobablyuseanexistinglibrary,butit'ssimpleenoughthatI'mgoingtoimeplmentitmyself.
Thefirstthingweneedisawaytocomputethedistancebetweentwopoints.Howdowedothis?In2-dimensionalspace,it'sprettyeasy.Ifwehaveapointatcoordinates andanotherat ,thedistancebetweenthemis
(Wheredidthiscomefrom?ItcomesfromthePythogoreantheorem:wehavearighttrianglewithsidelengths and ,andwewanttofindthelengthofthediagonal.)
In3-dimensionalspace,theformulais
In -dimensionalspace,thingsareabithardertovisualize,butIthinkyoucanseehowtheformulageneralized:wesumupthesquaresofthedifferencesbetweeneachindividualcoordinate,andthentakethesquarerootofthat.Let'simplementafunctiontocomputethisdistancefunctionforus:
defdistance(pt1,pt2):
total=0
foriinnp.arange(len(pt1)):
total=total+(pt1.item(i)-pt2.item(i))**2
returnmath.sqrt(total)
Next,we'regoingtowritesomecodetoimplementtheclassifier.Theinputisapatientpwhowewanttodiagnose.Theclassifierworksbyfindingthe nearestneighborsofpfromthetrainingset.So,ourapproachwillgolikethis:
1. Findtheclosest neighborsofp,i.e.,the patientsfromthetrainingsetthataremostsimilartop.
2. Lookatthediagnosesofthose neighbors,andtakethemajorityvotetofindthemost-commondiagnosis.Usethatasourpredicteddiagnosisforp.
SothatwillguidethestructureofourPythoncode.
ComputationalandInferentialThinking
272Classification
Toimplementthefirststep,wewillcomputethedistancefromeachpatientinthetrainingsettop,sortthembydistance,andtakethe closestpatientsinthetrainingset.Thecodewillmakeacopyofthetable,computethedistancefromeachpatienttop,addanewcolumntothetablewiththosedistances,andthensortthetablebydistanceandtakethefirst rows.ThatleadstothefollowingPythoncode:
defclosest(training,p,k):
...
defmajority(topkclasses):
...
defclassify(training,p,k):
kclosest=closest(training,p,k)
kclosest.classes=kclosest.select('Class')
returnmajority(kclosest)
ComputationalandInferentialThinking
273Classification
defcomputetablewithdists(training,p):
dists=np.zeros(training.num_rows)
attributes=training.drop('Class')
foriinnp.arange(training.num_rows):
dists[i]=distance(attributes.row(i),p)
returntraining.with_column('Distance',dists)
defclosest(training,p,k):
withdists=computetablewithdists(training,p)
sortedbydist=withdists.sort('Distance')
topk=sortedbydist.take(np.arange(k))
returntopk
defmajority(topkclasses):
iftopkclasses.where('Class',1).num_rows>topkclasses.where('Class'
return1
else:
return0
defclassify(training,p,k):
closestk=closest(training,p,k)
topkclasses=closestk.select('Class')
returnmajority(topkclasses)
Let'sseehowthisworks,withourdataset.We'lltakepatient12andimaginewe'regoingtotrytodiagnosethem:
patients.take(12)
ClumpThickness
UniformityofCellSize
UniformityofCellShape
MarginalAdhesion
SingleEpithelialCellSize
BareNuclei
BlandChromatin
5 3 3 3 2 3 4
Wecanpulloutjusttheirattributes(excludingtheclass),likethis:
patients.drop('Class').row(12)
ComputationalandInferentialThinking
274Classification
Row(ClumpThickness=5,UniformityofCellSize=3,UniformityofCellShape=3,MarginalAdhesion=3,SingleEpithelialCellSize=2,BareNuclei=3,BlandChromatin=4,NormalNucleoli=4,Mitoses=1)
Let'stake .Wecanfindthe5nearestneighbors:
closest(patients,patients.drop('Class').row(12),5)
ClumpThickness
UniformityofCellSize
UniformityofCellShape
MarginalAdhesion
SingleEpithelialCellSize
BareNuclei
BlandChromatin
5 3 3 3 2 3 4
5 3 3 4 2 4 3
5 1 3 3 2 2 2
5 2 2 2 2 2 3
5 3 3 1 3 3 3
3outofthe5nearestneighborshaveclass1,sothemajorityis1(hascancer)--andthatistheoutputofourclassifierforthispatient:
classify(patients,patients.drop('Class').row(12),5)
1
Awesome!Wenowhaveaclassificationalgorithmfordiagnosingwhetherapatienthasbreastcancerornot,basedonthemeasurementsfromthelab.Arewedone?Shallwegivethistodoctorstouse?
Holdon:we'renotdoneyet.There'sanobviousquestiontoanswer,beforewestartusingthisinpractice:
Howaccurateisthismethod,atdiagnosingbreastcancer?
Andthatraisesamorefundamentalissue.Howcanwemeasuretheaccuracyofaclassificationalgorithm?
Measuringaccuracyofaclassifier¶
We'vegotaclassifier,andwe'dliketodeterminehowaccurateitwillbe.Howcanwemeasurethat?
ComputationalandInferentialThinking
275Classification
Tryitout.Onenaturalideaistojusttryitonpatientsforayear,keeprecordsonit,andseehowaccurateitis.However,thishassomedisadvantages:(a)we'retryingsomethingonpatientswithoutknowinghowaccurateitis,whichmightbeunethical;(b)wehavetowaitayeartofindoutwhetherourclassifierisanygood.Ifit'snotgoodenoughandwegetanideaforanimprovement,we'llhavetowaitanotheryeartofindoutwhetherourimprovementwasbetter.
Getsomemoredata.Wecouldtrytogetsomemoredatafromotherpatientswhosediagnosisisknown,andmeasurehowaccurateourclassifier'spredictionsareonthoseadditionalpatients.Wecancomparewhattheclassifieroutputsagainstwhatweknowtobetrue.
Usethedatawealreadyhave.Anothernaturalideaistore-usethedatawealreadyhave:wehaveatrainingsetthatweusedtotrainourclassifier,sowecouldjustrunourclassifieroneverypatientinthedatasetandcomparewhatitoutputstowhatweknowtobetrue.Thisissometimesknownastestingtheclassifieronyourtrainingset.
Howshouldwechooseamongtheseoptions?Aretheyallequallygood?
Itturnsoutthatthethirdoption,testingtheclassifieronourtrainingset,isfundamentallyflawed.Itmightsoundattractive,butitgivesmisleadingresults:itwillover-estimatetheaccuracyoftheclassifier(itwillmakeusthinktheclassifierismoreaccuratethanitreallyis).Intuitively,theproblemisthatwhatwereallywanttoknowishowwelltheclassifierhasdoneat"generalizing"beyondthespecificexamplesinthetrainingset;butifwetestitonpatientsfromthetrainingset,thenwehaven'tlearnedanythingabouthowwellitwouldgeneralizetootherpatients.
Thisissubtle,soitmightbehelpfultotryanexample.Let'stryathoughtexperiment.Let'sfocusonthe1-nearestneighborclassifier( ).Supposeyoutrainedthe1-nearestneighborclassifierondatafromall683patientsinthedataset,andthenyoutesteditonthosesame683patients.Howmanywoulditgetright?Thinkitthroughandseeifyoucanworkoutwhatwillhappen.That'sright!Theclassifierwillgettherightanswerforall683patients.Supposeweapplytheclassifiertoapatientfromthetrainingset,sayAlice.Theclassifierwilllookforthenearestneighbor(themostsimilarpatientfromthetrainingset),andthenearestneighborwillturnouttobeAliceherself(thedistancefromanypointtoitselfiszero).Thus,theclassifierwillproducetherightdiagnosisforAlice.Thesamereasoningappliestoeveryotherpatientinthetrainingset.
So,ifwetestthe1-nearestneighborclassifieronthetrainingset,theaccuracywillalwaysbe100%:absolutelyperfect.Thisistruenomatterwhetherthereareactuallyanypatternsinthedata.Butthe100%isatotallie.Whenyouapplytheclassifiertootherpatientswhowerenotinthetrainingset,theaccuracycouldbefarworse.
ComputationalandInferentialThinking
276Classification
Inotherwords,testingonthetrainingtellsyounothingabouthowaccuratethe1-nearestneighborclassifierwillbe.Thisillustrateswhytestingonthetrainingsetissoflawed.Thisflawisprettyblatantwhenyouusethe1-nearestneighborclassifier,butdon'tthinkthatwithsomeotherclassifieryou'dbeimmunetothisproblem--theproblemisfundamentalandappliesnomatterwhatclassifieryouuse.Testingonthetrainingsetgivesyouabiasedestimateoftheclassifier'saccurate.Forthesereasons,youshouldnevertestonthetrainingset.
Sowhatshouldyoudo,instead?Isthereamoreprincipledapproach?
Itturnsoutthereis.Theapproachcomesdownto:getmoredata.Morespecifically,therightsolutionistouseonedatasetfortraining,andadifferentdatasetfortesting,withnooverlapbetweenthetwodatasets.Wecalltheseatrainingsetandatestset.
Wheredowegetthesetwodatasetsfrom?Typically,we'llstartoutwithsomedata,e.g.,thedataseton683patients,andbeforewedoanythingelsewithit,we'llsplititupintoatrainingsetandatestset.Wemightput50%ofthedataintothetrainingsetandtheother50%intothetestset.Basically,wearesettingasidesomedataforlateruse,sowecanuseittomeasuretheaccuracyofourclassifier.Sometimespeoplewillcallthedatathatyousetasidefortestingahold-outset,andthey'llcallthisstrategyforestimatingaccuracythehold-outmethod.
Notethatthisapproachrequiresgreatdiscipline.Beforeyoustartapplyingmachinelearningmethods,youhavetotakesomeofyourdataandsetitasidefortesting.Youmustavoidusingthetestsetfordevelopingyourclassifier:youshouldn'tuseittohelptrainyourclassifierortweakitssettingsorforbrainstormingwaystoimproveyourclassifier.Instead,youshoulduseitonlyonce,attheveryend,afteryou'vefinalizedyourclassifier,whenyouwantanunbiasedestimateofitsaccuracy.
Theeffectivenessofourclassifier,forbreastcancer¶OK,solet'sapplythehold-outmethodtoevaluatetheeffectivenessofthe -nearestneighborclassifierforbreastcancerdiagnosis.Thedatasethas683patients,sowe'llrandomlypermutethedatasetandput342oftheminthetrainingsetandtheremaining341inthetestset.
patients=patients.sample(683)#Randomlypermutetherows
trainset=patients.take(range(342))
testset=patients.take(range(342,683))
ComputationalandInferentialThinking
277Classification
We'lltraintheclassifierusingthe342patientsinthetrainingset,andevaluatehowwellitperformsonthetestset.Tomakeourliveseasier,we'llwriteafunctiontoevaluateaclassifieroneverypatientinthetestset:
defevaluate_accuracy(training,test,k):
testattrs=test.drop('Class')
numcorrect=0
foriinrange(test.num_rows):
#Runtheclassifierontheithpatientinthetestset
c=classify(training,testattrs.rows[i],k)
#Wastheclassifier'spredictioncorrect?
ifc==test.column('Class').item(i):
numcorrect=numcorrect+1
returnnumcorrect/test.num_rows
Nowforthegrandreveal--let'sseehowwedid.We'llarbitrarilyuse .
evaluate_accuracy(trainset,testset,5)
0.967741935483871
About96%accuracy.Notbad!Prettydarngoodforsuchasimpletechnique.
Asafootnote,youmighthavenoticedthatBrittanyWengerdidevenbetter.Whattechniquesdidsheuse?Onekeyinnovationisthatsheincorporatedaconfidencescoreintoherresults:heralgorithmhadawaytodeterminewhenitwasnotabletomakeaconfidentprediction,andforthosepatients,itdidn'teventrytopredicttheirdiagnosis.Heralgorithmwas99%accurateonthepatientswhereitmadeaprediction--sothatextensionseemedtohelpquiteabit.
Importanttakeaways¶
Hereareafewlessonswewantyoutolearnfromthis.
First,machinelearningispowerful.Ifyouhadtotrytowritecodetomakeadiagnosiswithoutknowingaboutmachinelearning,youmightspendalotoftimebytrial-and-errortryingtocomeupwithsomecomplicatedsetofrulesthatseemtowork,andtheresultmightnotbeveryaccurate.The -nearestneighborsalgorithmautomatestheentiretaskforyou.Andmachinelearningoftenletsthemmakepredictionsfarmoreaccuratelythananythingyou'dcomeupwithbytrial-and-error.
ComputationalandInferentialThinking
278Classification
Second,youcandoit.Yes,you.Youcanusemachinelearninginyourownworktomakepredictionsbasedondata.Younowknowenoughtostartapplyingtheseideastonewdatasetsandhelpothersmakeusefulpredictions.Thetechniquesareverypowerful,butyoudon'thavetohaveaPh.D.instatisticstousethem.
Third,becarefulabouthowtoevaluateaccuracy.Useahold-outset.
There'slotsmoreonecansayaboutmachinelearning:howtochooseattributes,howtochoose orotherparameters,whatotherclassificationmethodsareavailable,howtosolvemorecomplexpredictiontasks,andlotsmore.Inthiscourse,we'vebarelyevenscratchedthesurface.Ifyouenjoyedthismaterial,youmightenjoycontinuingyourstudiesinstatisticsandcomputerscience;courseslikeStats132and154andCS188and189gointoalotmoredepth.
ComputationalandInferentialThinking
279Classification
FeaturesSectionauthor:DavidWagner
Interact
Buildingamodel¶Sofar,wehavetalkedaboutprediction,wherethepurposeoflearningistobeabletopredicttheclassofnewinstances.I'mnowgoingtoswitchtomodelbuilding,wherethegoalistolearnamodelofhowtheclassdependsupontheattributes.
Oneplacewheremodelbuildingisusefulisforscience:e.g.,whichgenesinfluencewhetheryoubecomediabetic?Thisisinterestingandusefulinitsright(apartfromanyapplicationstopredictingwhetheraparticularindividualwillbecomediabetic),becauseitcanpotentiallyhelpusunderstandtheworkingsofourbody.
Anotherplacewheremodelbuildingisusefulisforcontrol:e.g.,whatshouldIchangeaboutmyadvertisementtogetmorepeopletoclickonit?HowshouldIchangetheprofilepictureIuseonanonlinedatingsite,togetmorepeopleto"swiperight"?Whichattributesmakethebiggestdifferencetowhetherpeopleclick/swipe?Ourgoalistodeterminewhichattributestochange,tohavethebiggestpossibleeffectonsomethingwecareabout.
Wealreadyknowhowtobuildaclassifier,givenatrainingset.Let'sseehowtousethatasabuildingblocktohelpussolvetheseproblems.
Howdowefigureoutwhichattributeshavethebiggestinfluenceontheoutput?Takeamomentandseewhatyoucancomeupwith.
Featureselection¶Background:attributesarealsocalledfeatures,inthemachinelearningliterature.
Ourgoalistofindasubsetoffeaturesthataremostrelevanttotheoutput.Thewaywe'llformalizeisthisistoidentifyasubsetoffeaturesthat,whenwetrainaclassifierusingjustthosefeatures,givesthehighestpossibleaccuracyatprediction.
Intuitively,ifweget90%accuracyusingallthefeaturesand88%accuracyusingjustthreeofthefeatures(forexample),thenitstandstoreasonthatthosethreefeaturesareprobablythemostrelevant,andtheycapturemostoftheinformationthataffectsordeterminesthe
ComputationalandInferentialThinking
280Features
output.
Withthisinsight,ourproblembecomes:
Findthesubsetof featuresthatgivesthebestpossibleaccuracy(whenweuseonlythose featuresforprediction).
Thisisafeatureselectionproblem.Therearemanypossibleapproachestofeatureselection.Onesimpleoneistotryallpossiblewaysofchoosing ofthefeatures,andevaluatetheaccuracyofeach.However,thiscanbeveryslow,becausetherearesomanywaystochooseasubsetof features.
Therefore,we'llconsideramoreefficientprocedurethatoftenworksreasonablywellinpractice.Itisknownasgreedyfeatureselection.Here'showitworks.
1. Supposethereare features.Tryeachonitsown,toseehowmuchaccuracywecangetusingaclassifiertrainedwithjustthatonefeature.Keepthebestfeature.
2. Nowwehaveonefeature.Tryremaining features,toseewhichisthebestonetoaddtoit(i.e.,wearenowtrainingaclassifierwithjust2features:thebestfeaturepickedinstep1,plusonemore).Keeptheonethatbestimprovesaccuracy.Nowwehave2features.
3. Repeat.Ateachstage,wetryallpossibilitiesforhowtoaddonemorefeaturetothefeaturesubsetwe'vealreadypicked,andwekeeptheonethatbestimprovesaccuracy.
Let'simplementitandtryitonsomeexamples!
Codefork-NN¶First,somecodefromlasttime,toimplement -nearestneighbors.
defdistance(pt1,pt2):
tot=0
foriinrange(len(pt1)):
tot=tot+(pt1[i]-pt2[i])**2
returnmath.sqrt(tot)
ComputationalandInferentialThinking
281Features
defcomputetablewithdists(training,p):
dists=np.zeros(training.num_rows)
attributes=training.drop('Class').rows
foriinrange(training.num_rows):
dists[i]=distance(attributes[i],p)
withdists=training.copy()
withdists.append_column('Distance',dists)
returnwithdists
defclosest(training,p,k):
withdists=computetablewithdists(training,p)
sortedbydist=withdists.sort('Distance')
topk=sortedbydist.take(range(k))
returntopk
defmajority(topkclasses):
iftopkclasses.where('Class',1).num_rows>topkclasses.where('Class'
return1
else:
return0
defclassify(training,p,k):
closestk=closest(training,p,k)
topkclasses=closestk.select('Class')
returnmajority(topkclasses)
defevaluate_accuracy(training,valid,k):
validattrs=valid.drop('Class')
numcorrect=0
foriinrange(valid.num_rows):
#Runtheclassifierontheithpatientinthetestset
c=classify(training,validattrs.rows[i],k)
#Wastheclassifier'spredictioncorrect?
ifc==valid['Class'][i]:
numcorrect=numcorrect+1
returnnumcorrect/valid.num_rows
ComputationalandInferentialThinking
282Features
Codeforfeatureselection¶Nowwe'llimplementthefeatureselectionalgorithm.First,asubroutinetoevaluatetheaccuracywhenusingaparticularsubsetoffeatures:
defevaluate_features(training,valid,features,k):
tr=training.select(['Class']+features)
va=valid.select(['Class']+features)
returnevaluate_accuracy(tr,va,k)
Next,we'llimplementasubroutinethat,givenacurrentsubsetoffeatures,triesallpossiblewaystoaddonemorefeaturetothesubset,andevaluatestheaccuracyofeachcandidate.Thisreturnsatablethatsummarizestheaccuracyofeachoptionitexamined.
deftry_one_more_feature(training,valid,baseattrs,k):
results=Table(['Attribute','Accuracy'])
forattrintraining.drop(['Class']+baseattrs).labels:
acc=evaluate_features(training,valid,[attr]+baseattrs,
results.append((attr,acc))
returnresults.sort('Accuracy',descending=True)
Finally,we'llimplementthegreedyfeatureselectionalgorithm,usingtheabovesubroutines.Forourownpurposesofunderstandingwhat'sgoingon,I'mgoingtohaveitprintout,ateachiteration,allfeaturesitconsideredandtheaccuracyitgotwitheach.
ComputationalandInferentialThinking
283Features
defselect_features(training,valid,k,maxfeatures=3):
results=Table(['NumAttrs','Attributes','Accuracy'])
curattrs=[]
iters=min(maxfeatures,len(training.labels)-1)
whilelen(curattrs)<iters:
print('==Computingbestfeaturetoaddto'+str(curattrs))
#Tryallwaysofaddingjustonemorefeaturetocurattrs
r=try_one_more_feature(training,valid,curattrs,k)
r.show()
print()
#Takethesinglebestfeatureandaddittocurattrs
attr=r['Attribute'][0]
acc=r['Accuracy'][0]
curattrs.append(attr)
results.append((len(curattrs),','.join(curattrs),acc))
returnresults
Example:TreeCover¶Nowlet'stryitoutonanexample.I'mworkingwithadatasetgatheredbytheUSForestryservice.Theyvisitedthousandsofwildnernesslocationsandrecordedvariouscharacteristicsofthesoilandland.Theyalsorecordedwhatkindoftreewasgrowingpredominantlyonthatland.FocusingonlyonareaswherethetreecoverwaseitherSpruceorLodgepolePine,let'sseeifwecanfigureoutwhichcharacteristicshavethegreatesteffectonwhetherthepredominanttreecoverisSpruceorLodgepolePine.
Thereare500,000recordsinthisdataset--morethanIcananalyzewiththesoftwarewe'reusing.So,I'llpickarandomsampleofjustafractionoftheserecords,toletusdosomeexperimentsthatwillcompleteinareasonableamountoftime.
all_trees=Table.read_table('treecover2.csv.gz',sep=',')
training=all_trees.take(range(0,1000))
validation=all_trees.take(range(1000,1500))
test=all_trees.take(range(1500,2000))
training.show(2)
ComputationalandInferentialThinking
284Features
Elevation Aspect Slope HorizDistToWater VertDistToWater HorizDistToRoad
2804 139 9 268 65 3180
2785 155 18 242 118 3090
...(998rowsomitted)
Let'sstartbyfiguringouthowaccurateaclassifierwillbe,iftrainedusingthisdata.I'mgoingtoarbitrarilyuse forthe -nearestneighborclassifier.
evaluate_accuracy(training,validation,15)
0.572
Nowwe'llapplyfeatureselection.IwonderwhichcharacteristicshavethebiggestinfluenceonwhetherSprucevsLodgepolePinegrows?We'lllookforthebest4features.
best_features=select_features(training,validation,15)
==Computingbestfeaturetoaddto[]
ComputationalandInferentialThinking
285Features
Attribute Accuracy
Elevation 0.762
HorizDistToRoad 0.562
HorizDistToFire 0.534
Hillshade9am 0.518
Hillshade3pm 0.516
Slope 0.516
Area4 0.514
Area3 0.514
Area2 0.514
Area1 0.514
VertDistToWater 0.512
HillshadeNoon 0.51
Aspect 0.51
HorizDistToWater 0.502
==Computingbestfeaturetoaddto['Elevation']
Attribute Accuracy
HorizDistToWater 0.782
Hillshade9am 0.776
Slope 0.77
Hillshade3pm 0.768
Area4 0.762
Area3 0.762
Area2 0.762
Area1 0.762
VertDistToWater 0.762
Aspect 0.75
HillshadeNoon 0.748
HorizDistToRoad 0.73
HorizDistToFire 0.664
ComputationalandInferentialThinking
286Features
==Computingbestfeaturetoaddto['Elevation','HorizDistToWater']
Attribute Accuracy
Hillshade3pm 0.792
HillshadeNoon 0.792
Slope 0.784
Aspect 0.784
Area4 0.782
Area3 0.782
Area2 0.782
Area1 0.782
Hillshade9am 0.78
VertDistToWater 0.776
HorizDistToRoad 0.708
HorizDistToFire 0.64
best_features
NumAttrs Attributes Accuracy
1 Elevation 0.762
2 Elevation,HorizDistToWater 0.782
3 Elevation,HorizDistToWater,Hillshade3pm 0.792
Aswecansee,Elevationlookslikefarandawaythemostdiscriminativefeature.Thissuggeststhatthischaracteristicmightplayalargeroleinthebiologyofwhichtreegrowsbest,andthusmighttellussomethingaboutthescience.
Whataboutthehorizontaldistancetowaterandtheamountofshadeat3pm?Itlooksliketheymightalsobepredictive,andthusmightplayaroleinthebiologyaswell--assumingtheapparentimprovementinaccuracyisreal,andwe'renotjustfoolingourselves.
Hold-outsets:Training,Validation,Testing¶
ComputationalandInferentialThinking
287Features
Supposewebuiltapredictorusingjustthebesttwofeatures,ElevationandHorizDistToWater.Howaccuratewouldweexpectittobe,ontheentirepopulationoflocations?78.2%accurate?more?less?Why?
Well,atthispoint,it'shardtotell.It'sthesameissuewementionedearlieraboutfoolingourselves.We'vetriedmultipledifferentapproaches,andtakenthebest;ifwethenevaluateitonthesamedatasetweusedtoselectwhichisbest,wewillgetabiasednumbers--itmightnotbeanaccurateestimateofthetrueaccuracy.
Why?Ourdatasetisnoisy.We'velookedforcorrelations,andkepttheassociationthathadthehighestcorrelationinourtrainingset.Butwasthatarealrelationship,orwasitjustnoise?Ifyoupickasingleattributeandmeasureitsaccuracyonasample,we'dexpectthistobeareasonableapproximationtoitsaccuracyontheentirepopulation,butwithsomerandomerror.Ifwelookat100possiblecombinationsandchoosethebest,it'spossiblewefoundonewhoseaccuracyontheentirepopulationisindeedlarge--orit'spossiblewewerejustselectingtheonewhoseerrortermhappenedtobethelargestofthe100.
Forthesereasons,wecan'texpecttheaccuracynumberswe'vecomputedabovetonecessarilybeagood,unbiasedmeasureoftheaccuracyontheentirepopulation.
Thewaytogetanunbiasedestimateofaccuracyisthesameaslastlecture:getsomemoredata;orsetsomeasideinthebeginningsowehavemorewhenweneedit.Inthiscase,Isetasidetwoextrachunksofdata,avalidationdatasetandatestdataset.Iusedthevalidationsettoselectafewbestfeatures.Nowwe'regoingtomeasuretheperformanceofthisonthetestset,justtoseewhathappens.
evaluate_features(training,validation,['Elevation'],15)
0.762
evaluate_features(training,test,['Elevation'],15)
0.712
evaluate_features(training,validation,['Elevation','HorizDistToWater'
0.782
ComputationalandInferentialThinking
288Features
evaluate_features(training,test,['Elevation','HorizDistToWater'],
0.742
Whydoyouthinkweseethisdifference?
Togetabetterfeelingforthis,wecanlookatallofourdifferentfeatures,andcompareitsaccuracyonthetrainingsetvsitsaccuracyonthevalidationset,toseehowmuchrandomerrorthereisintheseaccuracyfigures.Iftherewasnoerror,thenbothaccuracynumberswouldalwaysbethesame,butaswe'llsee,inpracticethereissomeerror,becausecomputingtheaccuracyonasampleisonlyanapproximationtotheaccuracyontheentirepopulation.
accuracies=Table(['ValidationAccuracy','Accuracyonanothersample'
another_sample=all_trees.take(range(2000,2500))
forfeatureintraining.drop('Class').labels:
x=evaluate_features(training,validation,[feature],15)
y=evaluate_features(training,another_sample,[feature],15)
accuracies.append((x,y))
accuracies.sort('ValidationAccuracy')
ValidationAccuracy Accuracyonanothersample
0.502 0.346
0.51 0.384
0.51 0.35
0.512 0.358
0.514 0.342
0.514 0.342
0.514 0.342
0.514 0.342
0.516 0.366
0.516 0.352
...(4rowsomitted)
ComputationalandInferentialThinking
289Features
accuracies.scatter('ValidationAccuracy')
/opt/conda/lib/python3.4/site-packages/matplotlib/collections.py:590:FutureWarning:elementwisecomparisonfailed;returningscalarinstead,butinthefuturewillperformelementwisecomparison
ifself._edgecolors==str('face'):
ThoughtQuestions¶SupposethatthetoptwoattributeshadbeenElevationandHorizDistToRoad.Interpretthisforme.Whatmightthismeanforthebiologyoftrees?Onepossibleexplanationisthatthedistancetothenearestroadaffectswhatkindoftreegrows;canyougiveanyotherpossibleexplanations?
OnceweknowthetoptwoattributesareElevationandHorizDistToWater,supposewenextwantedtoknowhowtheyaffectwhatkindoftreegrows:e.g.,doeshighelevationtendtofavorspruce,ordoesitfavorlodgepolepine?Howwouldyougoaboutansweringthesekindsofquestions?
ThescientistsalsogatheredsomemoredatathatIleftout,forsimplicity:foreachlocation,theyalsogatheredwhatkindofsoilithas,outof40differenttypes.Theoriginaldatasethadacolumnforsoiltype,withnumbersfrom1-40indicatingwhichofthe40typesofsoilwaspresent.SupposeIwantedtoincludethisamongtheothercharacteristics.Whatwouldgowrong,andhowcouldIfixitup?
ComputationalandInferentialThinking
290Features
Forthisexamplewepicked arbitrarily.Supposewewantedtopickthebestvalueof--theonethatgivesthebestaccuracy.Howcouldwegoaboutdoingthat?Whatarethe
pitfalls,andhowcouldtheybeaddressed?
SupposeIwantedtousefeatureselectiontohelpmeadjustmyonlinedatingprofilepicturetogetthemostresponses.TherearesomecharacteristicsIcan'tchange(suchashowhandsomeIam),andsomeIcan(suchaswhetherIsmileornot).HowwouldIadjustthefeatureselectionalgorithmabovetoaccountforthis?
ComputationalandInferentialThinking
291Features
Chapter5
Inference
ComputationalandInferentialThinking
292Inference
ConfidenceIntervalsInteractWheneverweanalyzearandomsampleofapopulation,ourgoalistodrawrobustconclusionsaboutthewholepopulation,despitethefactthatwecanmeasureonlypartofit.Wemustguardagainstbeingmisledbytheinevitablerandomvariationsinasample.Itispossiblethatasamplehasverydifferentcharacteristicsthanthepopulationitself,butthosesamplesareunlikely.Itismuchmoretypicalthatarandomsample,especiallyalargesimplerandomsample,isrepresentativeofthepopulation.Moreimportantly,thechancethatasampleresemblesthepopulationissomethingthatweknowinadvance,basedonhowweselectedourrandomsample.Statisticalinferenceisthepracticeofusingthisinformationtomakeprecisestatementsaboutapopulationbasedonarandomsample.
StatisticsandParameters¶TheBayAreaBikeShareservicepublishedadatasetdescribingeverybicyclerentalfromSeptember2014toAugust2015intheirsystem.Thissetofalltripsisapopulation,andfortunatelywehavethedataforeachandeverytripratherthanjustasample.We'llfocusonlyonthefreetrips,whicharetripsthatlastlessthan1800seconds(halfanhour).
free_trips=Table.read_table("trip.csv").where('Duration',are.below
free_trips
ComputationalandInferentialThinking
293ConfidenceIntervals
StartStation EndStation Duration
HarryBridgesPlaza(FerryBuilding)
SanFranciscoCaltrain(Townsendat4th) 765
SanAntonioShoppingCenter MountainViewCityHall 1036
PostatKearny 2ndatSouthPark 307
SanJoseCityHall SanSalvadorat1st 409
EmbarcaderoatFolsom EmbarcaderoatSansome 789
YerbaBuenaCenteroftheArts(3rd@Howard)
SanFranciscoCaltrain(Townsendat4th) 293
EmbarcaderoatFolsom EmbarcaderoatSansome 896
EmbarcaderoatSansome SteuartatMarket 255
BealeatMarket TemporaryTransbayTerminal(HowardatBeale) 126
PostatKearny SouthVanNessatMarket 932
...(338333rowsomitted)
Aquantitymeasuredforawholepopulationiscalledaparameter.Forexample,theaveragedurationforanyfreetripbetweenSeptember2014toAugust2015canbecomputedexactlyfromthesedata:550.00.
np.average(free_trips.column('Duration'))
550.00143345658103
Althoughwehaveinformationforthefullpopulation,wewillinvestigatewhatwecanlearnfromasimplerandomsample.Below,wesample100tripswithoutreplacementfromfree_tripsnotjustonce,but40differenttimes.All4000tripsarestoredinatablecalledsamples.
n=100
samples=Table(['Sample#','Duration'])
foriinnp.arange(40):
fortripinfree_trips.sample(n).rows:
samples.append([i,trip.item('Duration')])
samples
ComputationalandInferentialThinking
294ConfidenceIntervals
Sample# Duration
0 557
0 491
0 877
0 279
0 339
0 542
0 715
0 255
0 1543
0 194
...(3990rowsomitted)
Aquantitymeasuredforasampleiscalledastatistic.Forexample,theaveragedurationofafreetripinasampleisastatistic,andwefindthatthereissomevariationacrosssamples.
foriinnp.arange(3):
sample_average=np.average(samples.where('Sample#',i).column
print("Averagedurationinsample",i,"is",sample_average)
Averagedurationinsample0is547.29
Averagedurationinsample1is500.6
Averagedurationinsample2is660.16
Often,thegoalofanalysisistoestimateparametersfromstatistics.Properstatisticalinferenceinvolvesmorethanjustpickinganestimate:weshouldalsosaysomethingabouthowconfidentweshouldbeinthatestimate.Asingleguessisrarelysufficienttomakearobustconclusionaboutapopulation.Wealsoneedtoquantifyinsomewayhowprecisewebelievethatguesstobe.
Alreadywecanseefromtheexampleabovethattheaveragedurationofasampleisaround550,theparameter.Ourgoalistoquantifythisnotionofaround,andonewaytodosoistostatearangeofvalues.
ErrorinEstimates¶
ComputationalandInferentialThinking
295ConfidenceIntervals
Wecanlearnagreatdealaboutthebehaviorofastatisticbycomputingitformanysamples.Inthescatterdiagrambelow,theorderinwhichthestatisticwascomputedappearsonthehorizontalaxis,andthevalueoftheaveragedurationforasampleofntripsappearsontheverticalaxis.
samples.group(0,np.average).scatter(0,1,s=50)
Thesamplenumberdoesn'ttellusanythingbeyondwhenthesamplewasselected,andwecanseefromthisunstructuredcloudofpointsthatonesampledoesnotaffectthenextinanyconsistentway.Thesampleaveragesareindeedallaround550.Someareabove,andsomearebelow.
samples.group(0,np.average).hist(1,bins=np.arange(480,641,10))
ComputationalandInferentialThinking
296ConfidenceIntervals
Thereare40averagesfor40samples.Ifwefindanintervalthatcontainsthemiddle38outof40,itwillcontain95%ofthesamples.Onesuchintervalisboundedbythesecondsmallestandthesecondlargestsampleaveragesamongthe40samples.
averages=np.sort(samples.group(0,np.average).column(1))
lower=averages.item(1)
upper=averages.item(38)
print('Empirically,95%ofsampleaveragesfellwithintheinterval'
Empirically,95%ofsampleaveragesfellwithintheinterval502.38to620.04
Thisstatementgivesusourfirstnotionofamarginoferror.Ifthisstatementweretoholdupnotjustforour40samples,butforallsamples,thenareasonablestrategyforestimatingthetruepopulationaveragewouldbetodrawa100-tripsample,computeitssampleaverage,andguessthatthetrueaveragewaswithinmax_deviationofthisquantity.Wewouldberightatleast95%ofthetime.
Theproblemwiththisapproachisthatfindingthemax_deviationrequireddrawingmanydifferentsamples,andwewouldliketobeabletosaysomethingsimilarafterhavingcollectedonlyone100-tripsample.
VariabilityWithinSamples¶
Bycontrast,Theindividualdurationsinthesamplesthemselvesarenotcloselyclusteredaround550.Instead,theyspanthewholerangefrom0to1800.Thedurationsforthefirstthree100-itemsamplesarevisualizedbelow.Allhavesomewhatsimilarshapes;theywere
ComputationalandInferentialThinking
297ConfidenceIntervals
drawnfromthesamepopulation.
foriinnp.arange(3):
samples.where('Sample#',i).hist(1,bins=np.arange(0,1801,200
ComputationalandInferentialThinking
298ConfidenceIntervals
Wehavenowobservedtwodifferentwaysinwhichthevariationofarandomsamplecanbeobserved:thestatisticscomputedfromthosesamplesvaryinmagnitude,andthesampledquantitiesthemselvesvaryindistribution.Thatis,thereisvariabilityacrosssamples,whichweseebycomparingsampleaverages,andthereisvariabilitywithineachsample,whichweseeinthedurationhistogramsabove.
Percentiles¶The80thpercentileofacollectionofvaluesisthesmallestvalueinthecollectionthatisatleastaslargeas80%ofthevaluesinthecollection.Likewise,the40thpercentileisatleastaslargeas40%ofthevalues.
Forexample,let'sconsiderthesizesofthefivelargestcontinents:Africa,Antarctica,Asia,NorthAmerica,andSouthAmerica,roundedtothenearestmillionsquaremiles.
sizes=[12,17,6,9,7]
The20thpercentileisthesmallestvaluethatisatleastaslargeas20%oronefifthoftheelements.
percentile(20,sizes)
6
The60thpercentileisatleastaslargeas60%orthreefifthsoftheelements.
ComputationalandInferentialThinking
299ConfidenceIntervals
percentile(60,sizes)
9
Tofindtheelementcorrespondingtoapercentile ,sorttheelementsandtakethe thelement,where where isthesizeofthecollection.Ingeneral,any
percentilefrom0to100canbecomputedforacollectionofvalues,anditisalwaysanelementofthecollection.If isnotaninteger,roundituptofindthepercentileelement.Forexample,the90thpercentileofafiveelementsetisthefifthelement,because is
4.5,whichroundsupto5.
percentile(90,sizes)
17
Whenthepercentilefunctioniscalledwithasingleargument,itreturnsafunctionthatcomputespercentilesofcollections.
tenth=percentile(10)
tenth(sizes)
6
tenth(np.arange(50))
4
tenth(np.arange(5000))
499
ComputationalandInferentialThinking
300ConfidenceIntervals
ConfidenceIntervals¶Inferencebeginswithapopulation:acollectionthatyoucandescribebutoftencannotmeasuredirectly.Onethingwecanoftendowithapopuluationistodrawasimplerandomsample.Bymeasuringastatisticofthesample,wemayhopetoestimateaparameterofthepopulation.Inthisprocess,thepopulationisnotrandom,norareitsparameters.Thesampleisrandom,andsoisitsstatistic.Thewholeprocessofestimationthereforegivesarandomresult.Itstartsfromapopulation(fixed),drawsasample(random),computesastatistic(random),andthenestimatestheparameterfromthestatistic(random).Bycallingtheprocessrandom,wemeanthatthefinalestimatecouldhavebeendifferentbecausethesamplecouldhavebeendifferent.
Aconfidenceintervalisonenotionofamarginoferroraroundanestimate.Themarginoferroracknowledgesthattheestimationprocesscouldhavebeenmisleadingbecauseofthevariationinthesample.Itgivesarangethatmightcontaintheoriginalparameter,alongwithapercentageofconfidence.Forexample,a95%confidenceintervalisoneinwhichwewouldexpecttheintervaltocontainthetrueparameterfor95%ofrandomsamples.Foranyparticularsample,anyparticularestimate,andanyparticularinterval,theparameteriseithercontainedornot;weneverknowforsure.Thepurposeoftheconfidenceintervalistoquantifyhowoftenourestimationprocessgivesusarangecontainingthetruth.
Therearemanydifferentwaystocomputeaconfidenceinterval.Thechallengeincomputingaconfidenceintervalfromasinglesampleisthatwedon'tknowthepopulation,andsowedon'tknowhowitssampleswillvary.Onewaytomakeprogressdespitethislimitedinformationistoassumethatthevariabilitywithinthesampleissimilartothevariabilitywithinthepopulation.
ResamplingfromaSample¶
Eventhoughwehappentohavethewholepopulationoffree_trips,let'simagineforamomentthatwehaveonlyasingle100-tripsamplefromthispopulation.Thatsamplehas100durations.Ourgoalnowistolearnwhatwecanaboutthepopulationusingjustthissample.
sample=free_trips.sample(n)
sample
ComputationalandInferentialThinking
301ConfidenceIntervals
StartStation EndStation Duration
GrantAvenueatColumbusAvenue
SanFranciscoCaltrain2(330Townsend) 877
EmbarcaderoatFolsom SanFranciscoCaltrain(Townsendat4th) 597
BealeatMarket BealeatMarket 67
2ndatFolsom WashingtonatKearny 714
ClayatBattery EmbarcaderoatSansome 510
SouthVanNessatMarket PowellStreetBART 494
SanJoseDiridonCaltrainStation MLKLibrary 530
MechanicsPlaza(MarketatBattery) ClayatBattery 286
SouthVanNessatMarket CivicCenterBART(7thatMarket) 172
BroadwayStatBatterySt 5thatHoward 551
...(90rowsomitted)
average=np.average(sample.column('Duration'))
average
518.55999999999995
Atechniquecalledbootstrapresamplingusesthevariabilityofthesampleasaproxyforthevariabilityofthepopulation.Insteadofdrawingmanysamplesfromthepopulation,wewilldrawmanysamplesfromthesample.Sincetheoriginalsampleandthere-sampledsamplehavethesamesize,wemustdrawwithreplacementinordertoseeanyvariabilityatall.
resamples=Table(['Resample#','Duration'])
foriinnp.arange(40):
resample=sample.sample(n,with_replacement=True)
fortripinresample.rows:
resamples.append([i,trip.item('Duration')])
resamples
ComputationalandInferentialThinking
302ConfidenceIntervals
Resample# Duration
0 329
0 252
0 252
0 685
0 434
0 396
0 434
0 585
0 394
0 202
...(3990rowsomitted)
Usingtheresamples,wecaninvestigatethevariabilityofthecorrespondingsampleaverages.
resamples.group(0,np.average).scatter(0,1,s=50)
Thesesampleaveragesarenotnecessarilyclusteredaroundthetrueaverage550.Instead,theywillbeclusteredaroundtheaverageofthesamplefromwhichtheywerere-sampled.
ComputationalandInferentialThinking
303ConfidenceIntervals
resamples.group(0,np.average).hist(1,bins=np.arange(480,641,10))
Althoughthecenterofthisdistributionisnotthesameasthetruepopulationaverage,itsvariabilityissimilar.
BootstrapConfidenceInterval¶
Atechniquecalledbootstrapresamplingestimatesaconfidenceintervalusingresampling.Tofindaconfidenceinterval,wemustestimatehowmuchthesamplestatisticdeviatesfromthepopulationparameter.Theboostrapassumesthataresampledstatisticwilldeviatefromthesamplestatisticinapproximatelythesamewaythatthesamplestatisticdeviatesfromtheparameter.
Tocarryoutthetechnique,wefirstcomputethedeviationsfromthesamplestatistic(inthiscase,thesampleaverage)formanyresampledsamples.Thenumberisarbitrary,anditdoesn'trequireanycollectingofnewdata;1000ormoreistypical.
deviations=Table(['Resample#','Deviation'])
foriinnp.arange(1000):
resample=sample.sample(n,with_replacement=True)
dev=average-np.average(resample.column(2))
deviations.append([i,dev])
deviations
ComputationalandInferentialThinking
304ConfidenceIntervals
Resample# Deviation
0 30.95
1 41.16
2 3.37
3 2.54
4 -22.87
5 27.63
6 49.93
7 11.91
8 26.32
9 -5.97
...(990rowsomitted)
Adeviationofzeroindicatesthattheresampledstatisticisthesameasthesampledstatistic.That'sevidencethatthesamplestatisticisperhapsthesameasthepopulationparameter.Apositivedeviationisevidencethatthesamplestatisticmaybegreaterthantheparameter.Aswecansee,foraverages,thedistributionofdeviationsisroughlysymmetric.
deviations.hist(1,bins=np.arange(-100,101,10))
A95%confidenceintervaliscomputedbasedonthemiddle95%ofthisdistributionofdeviations.Themaxandmindevationsarecomputedbythe97.5thpercentileand2.5thpercentilerespectively,becausethespanbetweenthesecontains95%ofthedistribution.
ComputationalandInferentialThinking
305ConfidenceIntervals
lower=average-percentile(97.5,deviations.column(1))
lower
461.93000000000001
upper=average-percentile(2.5,deviations.column(1))
upper
576.29999999999995
Now,wehavenotonlyanestimate,butanintervalarounditthatexpressesouruncertaintyaboutthevalueoftheparameter.
print('A95%confidenceintervalaround',average,'is',lower,'to'
A95%confidenceintervalaround518.56is461.93to576.3
Andbehold,theintervalhappenstocontainthetrueparameter550.Normally,aftergeneratingaconfidenceinterval,thereisnowaytocheckthatitiscorrect,becausetodosorequiresaccesstotheentirepopulation.Inthiscase,westartedwiththewholepopulationinasingletable,andsoweareabletocheck.
Thefollowingfunctionencapsulatestheentireprocessofproducingaconfidenceintervalforthepopulationaverageusingasinglesample.Eachtimeitisrun,evenforthesamesample,theintervalwillbeslightlydifferentbecauseoftherandomvariationofresampling.However,theupperandlowerboundsgeneratedby2000samplesforthisproblemaretypicallyquitesimilarformultiplecalls.
ComputationalandInferentialThinking
306ConfidenceIntervals
defaverage_ci(sample,label):
deviations=Table(['Resample#','Deviation'])
n=sample.num_rows
average=np.average(sample.column(label))
foriinnp.arange(1000):
resample=sample.sample(n,with_replacement=True)
dev=np.average(resample.column(label))-average
deviations.append([i,dev])
return(average-percentile(97.5,deviations.column(1)),
average-percentile(2.5,deviations.column(1)))
average_ci(sample,'Duration')
(457.43999999999994,572.63999999999987)
average_ci(sample,'Duration')
(458.7399999999999,575.50999999999988)
However,ifwedrawanentirelynewsample,thentheconfidenceintervalwillchange.
average_ci(free_trips.sample(n),'Duration')
(530.18000000000006,661.21000000000004)
Theintervalisaboutthesamewidth,andthat'stypicalofmultiplesamplesofthesamesizedrawnfromthesamepopulation.However,ifamuchlargersampleisdrawn,thentheconfidenceintervalthatresultsistypicallygoingtobemuchsmaller.We'renotlosingconfidenceinthiscase;we'rejustusingmoreevidencetoperforminference,andsowehavelessuncertaintyaboutthefullpopulation.
average_ci(free_trips.sample(16*n),'Duration')
(527.18937499999993,555.11437499999988)
ComputationalandInferentialThinking
307ConfidenceIntervals
EvaluatingConfidenceIntervals¶
Whengivenonlyasample,itisnotpossibletocheckwhethertheparameterwasestimatedcorrectly.However,inthecaseoftheBayAreaBikeShare,wehavenotonlythesample,butmanysamplesandthewholepopulation.Therefore,wecancheckhowoftentheconfidenceintervalswecomputeinthiswaydoinfactcontainthetrueparameter.
Beforemeasuringtheaccuracyofourtechnique,wecanvisualizetheresultofresamplingeachsample.Thefollowingscatterdiagramhas6verticallinesfor6samples,andeachlineconsistsof100pointsthatrepresentthestatisticsfromresamplesforthatsample.
samples=Table(['Sample#','Resampleaverage','SampleAverage'])
foriinnp.arange(6):
sample=free_trips.sample(n)
average=np.average(sample.column('Duration'))
forjinnp.arange(100):
resample=sample.sample(n,with_replacement=True)
resample_average=np.average(resample.column('Duration'))
samples.append([i,resample_average,average])
samples.scatter(0,s=80)
Wecanseefromtheplotabovethatforalmosteverysample,therearesomeresampledaveragesabove550andsomebelow.Whenweusetheseresampledaveragestocomputeconfidenceintervals,manyofthoseintervalscontain550asaresult.
ComputationalandInferentialThinking
308ConfidenceIntervals
Below,wedraw100moresamplesandshowtheupperandlowerboundsoftheconfidenceintervalscomputedfromeachone.
intervals=Table(['Sample#','Lower','Estimate','Upper'])
foriinnp.arange(100):
sample=free_trips.sample(n)
average=np.average(sample.column('Duration'))
lower,upper=average_ci(sample,'Duration')
intervals.append([i,lower,average,upper])
Wecancomputepreciselywhichintervalscontainthetruth,andweshouldfindthatabout95ofthemdo.
correct=np.logical_and(intervals.column('Lower')<=550,intervals
np.count_nonzero(correct)
97
Eachconfidenceintervalisgeneratedfromonlyasinglesampleof100trips.Asaresult,thewidthoftheintervalvariesquiteabitaswell.Comparedtotheintervalwidthwecomputedoriginally(whichrequiredmultiplesamples),weseethatsomeoftheseintervalsaremuchmorenarrowandsomearemuchwider.
Table().with_column('Width',intervals.column('Upper')-intervals.column
ComputationalandInferentialThinking
309ConfidenceIntervals
Givenasinglesample,itisimpossibletoknowwhetheryouhavecorrectlyestimatedtheparameterornot.However,youcaninferaprecisenotionofconfidencebyfollowingaprocessbasedonrandomsampling.
ComputationalandInferentialThinking
310ConfidenceIntervals
DistanceBetweenDistributionsInteractBynowwehaveseenmanyexamplesofsituationsinwhichonedistributionappearstobequiteclosetoanother,suchasapopulationanditssample.Thegoalofthissectionistoquantifythedifferencebetweentwodistributions.Thiswillallowouranalysestobebasedonmorethantheassessementsthatweareabletomakebyeye.
Forthis,weneedameasureofthedistancebetweentwodistributions.Wewilldevelopsuchameasureinthecontextofastudyconductedin2010bytheAmericalCivilLibertiesUnion(ACLU)ofNorthernCalifornia.
ThefocusofthestudywastheracialcompositionofjurypanelsinAlamedaCounty.Ajurypanelisagroupofpeoplechosentobeprospectivejurors;thefinaltrialjuryisselectedfromamongthem.Jurypanelscanconsistofafewdozenpeopleorseveralthousand,dependingonthetrial.Bylaw,ajurypanelissupposedtoberepresentativeofthecommunityinwhichthetrialistakingplace.Section197ofCalifornia'sCodeofCivilProceduresays,"Allpersonsselectedforjuryserviceshallbeselectedatrandom,fromasourceorsourcesinclusiveofarepresentativecrosssectionofthepopulationoftheareaservedbythecourt."
Thefinaljuryisselectedfromthepanelbydeliberateinclusionorexclusion:thelawallowspotentialjurorstobeexcusedformedicalreasons;lawyersonbothsidesmaystrikeacertainnumberofpotentialjurorsfromthelistinwhatarecalled"peremptorychallenges";thetrialjudgemightmakeaselectionbasedonquestionnairesfilledoutbythepanel;andsoon.Buttheinitialpanelissupposedtoresemblearandomsampleofthepopulationofeligiblejurors.
TheACLUofNorthernCaliforniacompileddataontheracialcompositionofthejurypanelsin11felonytrialsinAlamedaCountyintheyears2009and2010.Inthosepanels,thetotalnumberofpoeplewhoreportedforjuryservicewas1453.TheACLUgathereddemographicdataonalloftheseprosepctivejurors,andcomparedthosedatawiththecompositionofalleligiblejurorsinthecounty.
Thedataaretabulatedbelowinatablecalledjury.Foreachrace,thefirstvalueisthepercentageofalleligiblejurorcandidatesofthatrace.Thesecondvalueisthepercentageofpeopleofthatraceamongthosewhoappearedfortheprocessofselectionintothejury.
ComputationalandInferentialThinking
311DistanceBetweenDistributions
jury=Table(["Race","Eligible","Panel"]).with_rows([
["Asian",0.15,0.26],
["Black",0.18,0.08],
["Latino",0.12,0.08],
["White",0.54,0.54],
["Other",0.01,0.04],
])
jury.set_format([1,2],PercentFormatter(0))
Race Eligible Panel
Asian 15% 26%
Black 18% 8%
Latino 12% 8%
White 54% 54%
Other 1% 4%
Someracesareoverrepresentedandsomeareunderrepresentedonthejurypanelsinthestudy.Abargraphishelpfulforvisualizingthedifferences.
jury.barh('Race')
TotalVariationDistance¶
ComputationalandInferentialThinking
312DistanceBetweenDistributions
Tomeasurethedifferencebetweenthetwodistributions,wewillcomputeaquantitycalledthetotalvariationdistancebetweenthem.Tocomputethetotalvariationdistance,takethedifferencebetweenthetwoproportionsineachcategory,adduptheabsolutevaluesofallthedifferences,andthendividethesumby2.
Herearethedifferencesandtheabsolutedifferences:
jury.append_column("Difference",jury.column("Panel")-jury.column
jury.append_column("Abs.Difference",np.abs(jury.column("Difference"
jury.set_format([3,4],PercentFormatter(0))
Race Eligible Panel Difference Abs.Difference
Asian 15% 26% 11% 11%
Black 18% 8% -10% 10%
Latino 12% 8% -4% 4%
White 54% 54% 0% 0%
Other 1% 4% 3% 3%
Andhereisthesumoftheabsolutedifferences,dividedby2:
sum(jury.column(3))/2
1.3877787807814457e-17
Thetotalvariationdistancebetweenthedistributionofeligiblejurorsandthedistributionofthepanelsis0.14.Beforeweexaminethenumercialvalue,letusexaminethereasonsbehindthestepsinthecalculation.
Itisquitenaturaltocomputethedifferencebetweentheproportionsineachcategory.Nowtakealookatthecolumndiffandnoticethatthesumofitsentriesis0:thepositiveentriesaddupto0.14,exactlycancelingthetotalofthenegativeentrieswhichis-0.14.TheproportionsineachofthetwocolumnsPanelandEligibleaddupto1,andsothegive-and-takebetweentheirentriesmustaddupto0.
Toavoidthecancellation,wedropthenegativesignsandthenaddalltheentries.Butthisgivesustwotimesthetotalofthepositiveentries(equivalently,twotimesthetotalofthenegativeentries,withthesignremoved).Sowedividethesumby2.
ComputationalandInferentialThinking
313DistanceBetweenDistributions
Wewouldhaveobtainedthesameresultbyjustaddingthepositivedifferences.Butourmethodofincludingalltheabsolutedifferenceseliminatestheneedtokeeptrackofwhichdifferencesarepositiveandwhicharenot.
Thefollowingfunctioncomputesthetotalvariationdistancebetweentwocolumnsofatable.
deftotal_variation_distance(column,other):
returnsum(np.abs(column-other))/2
deftable_tvd(table,label,other):
returntotal_variation_distance(table.column(label),table.column
table_tvd(jury,'Eligible','Panel')
0.14000000000000001
Arethepanelsrepresentativeofthepopulation?¶Wewillnowturntothenumericalvalueofthetotalvariationdistancebetweentheeligiblejurorsandthepanel.Howcanweinterpretthedistanceof0.14?Toanswerthis,letusrecallthatthepanelsaresupposedtobeselectedatrandom.Itwillthereforebeinformativetocomparethevalueof0.14withthetotalvariationdistancebetweentheeligiblejurorsandarandomlyselectedpanel.
Todothis,wewillemployourskillsatsimulation.Therewere1453prosepectivejurorsinthepanelsinthestudy.Soletustakearandomsampleofsize1453fromthepopulationofeligiblejurors.
[Technicalnote.Randomsamplesofprospectivejurorswouldbeselectedwithoutreplacement.However,whenthesizeofasampleissmallrelativetothesizeofthepopulation,samplingwithoutreplacementresemblessamplingwithreplacement;theproportionsinthepopulationdon'tchangemuchbetweendraws.ThepopulationofeligiblejurorsinAlamedaCountyisoveramillion,andcomparedtothat,asamplesizeofabout1500isquitesmall.Wewillthereforesamplewithreplacement.]
Thenp.random.multinomialfunctiondrawsarandomsampleuniformlywithreplacementfromapopulationwhosedistributioniscategorical.Specifically,np.random.multinomialtakesasitsinputasamplesizeandanarrayconsistingoftheprobabilitiesofchoosing
ComputationalandInferentialThinking
314DistanceBetweenDistributions
differentcategories.Itreturnsthecountofthenumberoftimeseachcategorywaschosen.
sample_size=1453
random_panel=np.random.multinomial(sample_size,jury["Eligible"])
random_panel
array([208,265,166,805,9])
sum(random_panel)
1453
Wecancomparethisdistributionwiththedistributionofeligiblejurors,byconvertingthecountstoproportions.Todothis,wewilldividethecountsbythesamplesize.Itisclearfromtheresultsthatthetwodistributionsarequitesimilar.
sampled=jury.select(['Race','Eligible','Panel']).with_column(
'RandomPanel',random_panel/sample_size)
sampled.set_format(3,PercentFormatter(0))
Race Eligible Panel RandomPanel
Asian 15% 26% 14%
Black 18% 8% 18%
Latino 12% 8% 11%
White 54% 54% 55%
Other 1% 4% 1%
Asalways,ithelpstovisualize.Thepopulationandsamplearequitesimilar.
sampled.barh('Race',[1,3])
ComputationalandInferentialThinking
315DistanceBetweenDistributions
Whenwecomparealsotothepanel,thedifferenceisstark.
sampled.barh('Race',[1,3,2])
Thetotalvariationdistancebetweenthepopulationdistributionandthesampledistributionisquitesmall.
table_tvd(sampled,'Eligible','RandomPanel')
0.016407432897453545
Comparingthistothedistanceof0.14forthepanelquantifiesourobservationthattherandomsampleisclosetothedistributionofeligiblejurors,andthepanelisrelativelyfarfromthedistributionofeligiblejurors.
However,thedistancebetweentherandomsampleandtheeligiblejurorsdependsonthesample;samplingagainmightgiveadifferentresult.
ComputationalandInferentialThinking
316DistanceBetweenDistributions
Howdorandomsamplesdifferfromthepopulation?¶
Thetotalvariationdistancebetweenthedistributionoftherandomsampleandthedistributionoftheeligiblejurorsisthestatisticthatweareusingtomeasurethedistancebetweenthetwodistributions.Byrepeatingtheprocessofsampling,wecanmeasurehowmuchthestatisticvariesacrossdifferentrandomsamples.Thecodebelowcomputestheempiricaldistributionofthestatisticbasedonalargenumberofreplicationsofthesamplingprocess.
#ComputeempiricaldistributionofTVDs
sample_size=1453
repetitions=1000
eligible=jury.column('Eligible')
tvds=Table(["TVDfromarandomsample"])
foriinnp.arange(repetitions):
sample=np.random.multinomial(sample_size,eligible)/sample_size
tvd=sum(abs(sample-eligible))/2
tvds.append([tvd])
tvds
TVDfromarandomsample
0.0207571
0.0286098
0.0201721
0.0308672
0.0185134
0.0102546
0.0119133
0.0346731
0.0271576
0.0216862
...(990rowsomitted)
ComputationalandInferentialThinking
317DistanceBetweenDistributions
Eachrowofthecolumnabovecontainsthetotalvariationdistancebetweenarandomsampleandthepopulationofeligiblejurors.
Thehistogramofthiscolumnshowsthatdrawing1453jurorsatrandomfromthepoolofeligiblecandidatesresultsinadistributionthatrarelydeviatesfromtheeligiblejurors'racedistributionbymorethan0.05.
tvds.hist(bins=np.arange(0,0.2,0.005))
Howdothepanelscomparetorandomsamples?¶
Thepanelsinthestudy,however,werenotquitesosimilartotheeligiblepopulation.Thetotalvariationdistancebetweenthepanelsandthepopulationwas0.14,whichisfaroutinthetailofthehistogramabove.Itdoesnotlooklikeatypicaldistancebetweenarandomsampleandtheeligiblepopulation.
OuranalysissupportstheACLU'sconclusionthatthepanelswerenotrepresentativeofthepopulation.TheACLUreportdiscussesseveralreasonsforthediscrepancies.Forexample,someminoritygroupswereunderrepresentedontherecordsofvoterregistrationandoftheDepartmentofMotorVehicles,thetwomainsourcesfromwhichjurorsareselected;atthetimeofthestudy,thecountydidnothaveaneffectiveprocessforfollowinguponprospectivejurorswhohadbeencalledbuthadfailedtoappear;andsoon.Whateverthereasons,itseemsclearthatthecompositionofthejurypanelswasdifferentfromwhatwewouldhaveexpectedinarandomsample.
AClassicalInterlude:theChi-SquaredStatistic¶
ComputationalandInferentialThinking
318DistanceBetweenDistributions
"Dothedatalooklikearandomsample?"isaquestionthathasbeenaskedinmanycontextsformanyyears.Inclassicaldataanalysis,thestatisticmostcommonlyusedinansweringthisquestioniscalledthe ("chisquared")statistic,andiscalculatedasfollows:
Step1.Foreachcategory,definethe"expectedcount"asfollows:
Thisisthecountthatyouwouldexpectinthatcategory,inarandomlychosensample.
Step2.Foreachcategory,compute
Step3.AddupallthenumberscomputedinStep2.
Alittlealgebrashowsthatthisisequivalentto:
AlternativeMethod,Step1.Foreachcategory,compute
AlternativeMethod,Step2.Addupallthenumbersyoucomputedinthestepabove,andmultiplythesumbythesamplesize.
Itmakessensethatthestatisticshouldbebasedonthedifferencebetweenproportionsineachcategory,justasthetotalvariationdistanceis.Theremainingstepsofthemethodarenotveryeasytoexplainwithoutgettingdeeperintomathematics.
Thereasonforchoosingthisstatisticoveranyotherwasthatstatisticianswereabletocomeupaformulaforitsapproximateprobabilitydistributionwhenthesamplesizeislarge.Thedistributionhasanimpressivename:itiscalledthe distributionwithdegreesoffreedomequalto4(onefewerthanthenumberofcategories).Dataanalystswouldcomputethestatisticfortheirsample,andthencompareitwithtabulatednumericalvaluesofthedistributions.
Simulatingthe statistic.
The statisticisjustastatisticlikeanyother.Wecanusethecomputertosimulateitsbehaviorevenifwehavenotstudiedalltheunderlyingmathematics.
ForthepanelsintheACLUstudy,the statisticisabout348:
ComputationalandInferentialThinking
319DistanceBetweenDistributions
sum(((jury["Panel"]-jury["Eligible"])**2)/jury["Eligible"])*sample_size
348.07422222222226
Togeneratetheempiricaldistributionofthestatistic,allweneedisasmallmodificationofthecodeweusedforsimulatingtotalvariationdistance:
#Computeempiricaldistributionofchi-squaredstatistic
classical_from_sample=Table(["'Chi-squared'statistic,fromarandomsample"
foriinnp.arange(repetitions):
sample=np.random.multinomial(sample_size,eligible)/sample_size
chisq=sum(((sample-eligible)**2)/eligible)*sample_size
classical_from_sample.append([chisq])
classical_from_sample
'Chi-squared'statistic,fromarandomsample
5.29816
1.83597
5.03969
4.3795
10.738
4.74311
4.15047
5.83918
9.6604
0.127756
...(990rowsomitted)
Hereisahistogramoftheempiricaldistributionofthe statistic.Thesimulatedvaluesofbasedonrandomsamplesareconsiderablysmallerthanthevalueof348thatwegot
fromthejurypanels.
ComputationalandInferentialThinking
320DistanceBetweenDistributions
classical_from_sample.hist(bins=np.arange(0,18,0.5),normed=True)
Infact,thelongrunaveragevalueof statisticsisequaltothedegreesoffreedom.Andindeed,theaverageofthesimulatedvaluesof iscloseto4,thedegreesoffreedominthiscase.
np.mean(classical_from_sample.column(0))
3.978631566873136
Inthissituation,randomsamplesareexpectedtoproducea statisticofabout4.Thepanels,ontheotherhand,produceda statisticof348.Thisisyetmoreconfirmationoftheconclusionthatthepanelsdonotresemblearandomsamplefromthepopulationofeligiblejurors.
ComputationalandInferentialThinking
321DistanceBetweenDistributions
HypothesisTestingInteractOurinvestigationofjurypanelsisanexampleofhypothesistesting.Wewishedtoknowtheanswertoayes-or-noquestionabouttheworld,andwereasonedaboutrandomsamplesinordertoanswerthequestion.Inthissection,weformalizetheprocess.
SamplingfromaDistribution¶Therowsofatabledescribeindividualelementsofacollectionofdata.Intheprevioussection,weworkedwithatablethatdescribedadistributionovercategories.Tablesthatdescribetheproportionsorcountsofdifferentcategoriesinasampleorpopulationariseoftenindataanalysisasasummaryofadatacollectionprocess.Thesample_from_distributionmethoddrawsasampleofcountsforthecategories.Itsimplementationusesthesamenp.random.multinomialmethodfromtheprevioussection.
ThetablebelowdescribestheproportionoftheeligiblepotentialjurorsinAlamedaCounty(estimatedfrom2000censusdataandothersources)forbroadcategoriesofraceandethnicbackground.ThistablewascompiledbyProfessorWeeks,ademographeratSanDiegoStateUniversity,fortheAlamedaCountytrialPeoplev.StuartAlexanderin2002.
population=Table(["Race","Eligible"]).with_rows([
["Asian",0.15],
["Black",0.18],
["Latino",0.12],
["White",0.54],
["Other",0.01],
])
population
Race Eligible
Asian 0.15
Black 0.18
Latino 0.12
White 0.54
Other 0.01
ComputationalandInferentialThinking
322HypothesisTesting
Thesample_from_distributionmethodtakesanumberofsamplesandacolumnindexorlabel.Itreturnsatableinwhichthedistributioncolumnhasbeenreplacedbycategorycountsofthesample.
sample_size=1453
population.sample_from_distribution("Eligible",sample_size)
Race Eligible Eligiblesample
Asian 0.15 233
Black 0.18 245
Latino 0.12 177
White 0.54 787
Other 0.01 11
Eachcountisselectedrandomlyinsuchawaythatthechanceofeachcategoryisthecorrespondingproportionintheoriginal"Eligible"column.Samplingagainwillgivedifferentcounts,butagainthe"White"categoryismuchmorecommonthan"Other"becauseofitsmuchhigherproportion.
population.sample_from_distribution("Eligible",sample_size)
Race Eligible Eligiblesample
Asian 0.15 227
Black 0.18 247
Latino 0.12 155
White 0.54 819
Other 0.01 5
Theoptionproportions=Truedrawssamplecountsandthendivideseachcountbythetotalnumberofsamples.Theresultisanotherdistribution;onebasedonasample.
sample=population.sample_from_distribution("Eligible",sample_size
sample
ComputationalandInferentialThinking
323HypothesisTesting
Race Eligible Eligiblesample
Asian 0.15 0.140399
Black 0.18 0.189264
Latino 0.12 0.125258
White 0.54 0.532691
Other 0.01 0.0123882
Thisutsamplecanalsobeusedtogenerateasample.Thesampleofasampleiscalledaresampledsampleorabootstrapsample.Itisnotarandomsampleoftheoriginaldistribution,butsharesmanycharacteristicswithsuchasample.
sample.sample_from_distribution('Eligiblesample',sample_size)
Race Eligible Eligiblesample Eligiblesamplesample
Asian 0.15 0.140399 204
Black 0.18 0.189264 258
Latino 0.12 0.125258 163
White 0.54 0.532691 810
Other 0.01 0.0123882 18
EmpiricalDistributions¶Anempiricaldistributionofastatisticisatablegeneratedbycomputingthestatisticformanysamples.Theprevioussectionusedempiricaldistributionstoreasonaboutwhetherornotasamplehadastatisticthatwastypicalofarandomsample.Thefollowingfunctioncomputestheempiricalhistogramofastatistic,whichisexpressedasafunctionthattakesasample.
defempirical_distribution(table,label,sample_size,k,f):
stats=Table(['Sample#','Statistic'])
foriinnp.arange(k):
sample=table.sample_from_distribution(label,sample_size)
statistic=f(sample)
stats.append([i,statistic])
returnstats
ComputationalandInferentialThinking
324HypothesisTesting
Thisfunctioncanbeusedtocomputeanempiricaldistributionofthetotalvariationdistancefromthepopulationdistribution.
deftvd_from_eligible(sample):
counts=sample.column('Eligiblesample')
distribution=counts/sum(counts)
returntotal_variation_distance(population.column('Eligible'),
empirical_distribution(population,'Eligible',sample_size,1000,tvd_from_eligible
Sample# Statistic
0 0.0248796
1 0.0195664
2 0.00401927
3 0.0165864
4 0.0207502
5 0.013042
6 0.0297041
7 0.0153544
8 0.00830695
9 0.014384
...(990rowsomitted)
Theempiricaldistributioncanbevisualizedusingahistogram.
empirical_distribution(population,'Eligible',sample_size,1000,tvd_from_eligible
ComputationalandInferentialThinking
325HypothesisTesting
U.S.SupremeCourt,1965:Swainvs.Alabama¶Intheearly1960's,inTalladegaCountyinAlabama,ablackmancalledRobertSwainwasconvictedofrapingawhitewomanandwassentencedtodeath.Heappealedhissentence,citingamongotherfactorstheall-whitejury.Atthetime,onlymenaged21orolderwereallowedtoserveonjuriesinTalladegaCounty.Inthecounty,26%oftheeligiblejurorswereblack,buttherewereonly8blackmenamongthe100selectedforthejurypanelinSwain'strial.Noblackmanwasselectedforthetrialjury.
In1965,theSupremeCourtoftheUnitedStatesdeniedSwain'sappeal.Initsruling,theCourtwrote"...theoverallpercentagedisparityhasbeensmallandreflectsnostudiedattempttoincludeorexcludeaspecifiednumberof[blackmen]."
Letususethemethodswehavedevelopedtoexaminethedisparitybetween8outof100and26outof100blackmeninapanelof100drawnatrandomfromamongtheeligiblejurors.
alabama=Table(['Race','Eligible']).with_rows([
["Black",0.26],
["Other",0.74]
])
alabama.set_format(1,PercentFormatter(0))
Race Eligible
Black 26%
Other 74%
ComputationalandInferentialThinking
326HypothesisTesting
Asourteststatistic,wewillusethenumberofblackmeninarandomsampleofsize100.Hereisanempiricaldistributionofthisstatistic.
defvalue_for(category,category_label,value_label):
defin_table(sample):
returnsample.where(category_label,category).column(value_label
returnin_table
num_black=value_for('Black','Race','Eligiblesample')
black_in_sample=empirical_distribution(alabama,'Eligible',100,
black_in_sample
Sample# Statistic
0 27
1 31
2 28
3 27
4 24
5 23
6 29
7 28
8 16
9 25
...(9990rowsomitted)
Thenumbersofblackmeninthefirst10repetitionswerequiteabitlargerthan8.Theempiricalhistogrambelowshowsthedistributionoftheteststatisticoveralltherepetitions.
black_in_sample.hist(1,bins=np.arange(0,50,1))
ComputationalandInferentialThinking
327HypothesisTesting
Ifthe100meninSwain'sjurypanelhadbeenchosenatrandom,itwouldhavebeenextremelyunlikelyforthenumberofblackmenonthepaneltobeassmallas8.Wemustconcludethatthepercentagedisparitywaslargerthanthedisparityexpectedduetochancevariation.
MethodandTerminologyofStatisticalTestsofHypotheses¶Wehavedevelopedsomeofthefundamentalconceptsofstatisticaltestsofhypotheses,inthecontextofexamplesaboutjuryselection.Usingstatisticaltestsasawayofmakingdecisionsisstandardinmanyfieldsandhasastandardterminology.Hereisthesequenceofthestepsinmoststatisticaltests,alongwithsometerminologyandexamples.
STEP1:THEHYPOTHESES
Allstatisticaltestsattempttochoosebetweentwoviewsofhowthedataweregenerated.Thesetwoviewsarecalledhypotheses.
Thenullhypothesis.Thissaysthatthedataweregeneratedatrandomunderclearlyspecifiedassumptionsthatmakeitpossibletocomputechances.Theword"null"reinforcestheideathatifthedatalookdifferentfromwhatthenullhypothesispredicts,thedifferenceisduetonothingbutchance.
Inbothofourexamplesaboutjuryselection,thenullhypothesisisthatthepanelswereselectedatrandomfromthepopulationofeligiblejurors.Thoughtheracialcompositionofthepanelswasdifferentfromthatofthepopulationsofeligiblejurors,therewasnoreasonforthedifferenceotherthanchancevariation.
ComputationalandInferentialThinking
328HypothesisTesting
Thealternativehypothesis.Thissaysthatsomereasonotherthanchancemadethedatadifferfromwhatwaspredictedbythenullhypothesis.Informally,thealternativehypothesissaysthattheobserveddifferenceis"real."
Inbothofourexamplesaboutjuryselection,thealternativehypothesisisthatthepanelswerenotselectedatrandom.Somethingotherthanchanceledtothedifferencesbetweentheracialcompositionofthepanelsandtheracialcompositionofthepopulationsofeligiblejurors.
STEP2:THETESTSTATISTIC
Inordertodecidebetweenthetwohypothesis,wemustchooseastatisticuponwhichwewillbaseourdecision.Thisiscalledtheteststatistic.
IntheexampleaboutjurypanelsinAlamedaCounty,theteststatisticweusedwasthetotalvariationdistancebetweentheracialdistributionsinthepanelsandinthepopulationofeligiblejurors.IntheexampleaboutSwainversustheStateofAlabama,theteststatisticwasthenumberofblackmenonthejurypanel.
Calculatingtheobservedvalueoftheteststatisticisoftenthefirstcomputationalstepinastatisticaltest.InthecaseofjurypanelsinAlamedaCounty,theobservedvalueofthetotalvariationdistancebetweenthedistributionsinthepanelsandthepopulationwas0.14.IntheexampleaboutSwain,thenumberofblackmenonhisjurypanelwas8.
STEP3:THEPROBABILITYDISTRIBUTIONOFTHETESTSTATISTIC,UNDERTHENULLHYPOTHESIS
Thisstepsetsasidetheobservedvalueoftheteststatistic,andinsteadfocusesonwhatthevaluemightbeifthenullhypothesisweretrue.Underthenullhypothesis,thesamplecouldhavecomeoutdifferentlyduetochance.Sotheteststatisticcouldhavecomeoutdifferently.Thisstepconsistsoffiguringoutallpossiblevaluesoftheteststatisticandalltheirprobabilities,underthenullhypothesisofrandomness.
Inotherwords,inthisstepwecalculatetheprobabilitydistributionoftheteststatisticpretendingthatthenullhypothesisistrue.Formanyteststatistics,thiscanbeadauntingtaskbothmathematicallyandcomputationally.Therefore,weapproximatetheprobabilitydistributionoftheteststatisticbytheempiricaldistributionofthestatisticbasedonalargenumberofrepetitionsofthesamplingprocedure.
Thiswastheempiricaldistributionoftheteststatistic–thenumberofblackmenonthejurypanel–intheexampleaboutSwainandtheSupremeCourt:
black_in_sample.hist(1,bins=np.arange(0,50,1))
ComputationalandInferentialThinking
329HypothesisTesting
STEP4.THECONCLUSIONOFTHETEST
ThechoicebetweenthenullandalternativehypothesesdependsonthecomparisonbetweentheresultsofSteps2and3:theobservedteststatisticanditsdistributionaspredictedbythenullhypothesis.
Ifthetwoareconsistentwitheachother,thentheobservedteststatisticisinlinewithwhatthenullhypothesispredicts.Inotherwords,thetestdoesnotpointtowardsthealternativehypothesis;thenullhypothesisisbettersupportedbythedata.
Butifthetwoarenotconsistentwitheachother,asisthecaseinbothofourexamplesaboutjurypanels,thenthedatadonotsupportthenullhypothesis.Inbothourexamplesweconcludedthatthejurypanelswerenotselectedatrandom.Somethingotherthanchanceaffectedtheircomposition.
Howis"consistent"defined?Whethertheobservedteststatisticisconsistentwithitspredicteddistributionunderthenullhypothesisisamatterofjudgment.Werecommendthatyouprovideyourjudgmentalongwiththevalueoftheteststatisticandagraphofitspredicteddistribution.Thatwillallowyourreadertomakehisorherownjudgmentaboutwhetherthetwoareconsistent.
Ifyoudonotwanttomakeyourownjudgment,thereareconventionsthatyoucanfollow.TheseconventionsarebasedonwhatiscalledtheobservedsignificancelevelorP-valueforshort.TheP-valueisachancecomputedusingtheprobabilitydistributionoftheteststatistic,andcanbeapproximatedbyusingtheempiricaldistributioninStep3.
PracticalnoteonP-valuesandconventions.Placetheobservedteststatisticonthehorizontalaxisofthehistogram,andfindtheproportioninthetailstartingatthatpoint.That'stheP-value.
IftheP-valueissmall,thedatasupportthealternativehypothesis.Theconventionsforwhatis"small":
ComputationalandInferentialThinking
330HypothesisTesting
IftheP-valueislessthan5%,theresultiscalled"statisticallysignificant."
IftheP-valueisevensmaller–lessthan1%–theresultiscalled"highlystatisticallysignificant."
MoreformaldefinitionofP-value.TheP-valueisthechance,underthenullhypothesis,thattheteststatisticisequaltothevaluethatwasobservedorisevenfurtherinthedirectionofthealternative.
TheP-valueisbasedoncomparingtheobservedteststatisticandwhatthenullhypothesispredicts.AsmallP-valueimpliesthatunderthenullhypothesis,thevalueoftheteststatisticisunlikelytobeneartheoneweobserved.Whenahypothesisandthedataarenotinaccord,thehypothesishastogo.ThatiswhywerejectthenullhypothesisiftheP-valueissmall.
TheP-valueforthenullhypothesisthatSwain'sjurypanelwasselectedatrandomfromthepopulationofTalladegaisapproximatedusingtheempiricaldistributionasfollows:
np.count_nonzero(black_in_sample.column(1)<=8)/black_in_sample.
0.0
HISTORICALNOTEONTHECONVENTIONS
Thedeterminationofstatisticalsignificance,asdefinedabove,hasbecomestandardinstatisticalanalysesinallfieldsofapplication.Whenaconventionissouniversallyfollowed,itisinterestingtoexaminehowitarose.
Themethodofstatisticaltesting–choosingbetweenhypothesesbasedondatainrandomsamples–wasdevelopedbySirRonaldFisherintheearly20thcentury.SirRonaldmighthavesettheconventionforstatisticalsignificancesomewhatunwittingly,inthefollowingstatementinhis1925bookStatisticalMethodsforResearchWorkers.Aboutthe5%level,hewrote,"Itisconvenienttotakethispointasalimitinjudgingwhetheradeviationistobeconsideredsignificantornot."
Whatwas"convenient"forSirRonaldbecameacutoffthathasacquiredthestatusofauniversalconstant.NomatterthatSirRonaldhimselfmadethepointthatthevaluewashispersonalchoicefromamongmany:inanarticlein1926,hewrote,"Ifoneintwentydoesnotseemhighenoughodds,wemay,ifwepreferitdrawthelineatoneinfifty(the2percentpoint),oroneinahundred(the1percentpoint).Personally,theauthorpreferstosetalowstandardofsignificanceatthe5percentpoint..."
ComputationalandInferentialThinking
331HypothesisTesting
Fisherknewthat"low"isamatterofjudgmentandhasnouniquedefinition.Wesuggestthatyoufollowhisexcellentexample.Provideyourdata,makeyourjudgment,andexplainwhyyoumadeit.
ComputationalandInferentialThinking
332HypothesisTesting
PermutationInteract
ComparingTwoGroups¶Intheexamplesabove,weinvestigatedwhetherasampleappearstobechosenrandomlyfromanunderlyingpopulation.Wedidthisbycomparingthedistributionofthesamplewiththedistributionofthepopulation.Asimilarlineofreasoningcanbeusedtocomparethedistributionsoftwosamples.Inparticular,wecaninvestigatewhetherornottwosamplesappeartobedrawnfromthesameunderlyingdistribution.
Example:MarriedCouplesandUnmarriedPartners¶
Ournextexampleisbasedonastudyconductedin2010undertheauspicesoftheNationalCenterforFamilyandMarriageResearch.
IntheUnitedStates,theproportionofcoupleswholivetogetherbutarenotmarriedhasbeenrisinginrecentdecades.Thestudyinvolvedanationalrandomsampleofover1,000heterosexualcoupleswhowereeithermarriedor"cohabitingpartners"–livingtogetherbutunmarried.Oneofthegoalsofthestudywastocomparetheattitudesandexperiencesofthemarriedandunmarriedcouples.
Thetablebelowshowsasubsetofthedatacollectedinthestudy.Eachrowcorrespondstooneperson.Thevariablesthatwewillexamineinthissectionare:
MaritalStatus:marriedorunmarriedEmploymentStatus:oneofseveralcategoriesdescribedbelowGenderAge:Ageinyears
columns=['MaritalStatus','EmploymentStatus','Gender','Age',
couples=Table.read_table('couples.csv').select(columns)
couples
ComputationalandInferentialThinking
333Permutation
MaritalStatus EmploymentStatus Gender Age
married workingaspaidemployee male 51
married workingaspaidemployee female 53
married workingaspaidemployee male 57
married workingaspaidemployee female 57
married workingaspaidemployee male 60
married workingaspaidemployee female 57
married working,self-employed male 62
married workingaspaidemployee female 59
married notworking-other male 53
married notworking-retired female 61
...(2056rowsomitted)
Letusconsiderjustthemalesfirst.Thereare742marriedcouplesand292unmarriedcouples,andallcouplesinthisstudyhadonemaleandonefemale,making1,034malesinall.
#Separatetablesformarriedandcohabitingunmarriedcouples:
married_men=couples.where('Gender','male').where('MaritalStatus'
partnered_men=couples.where('Gender','male').where('MaritalStatus'
#Let'sseehowmanymarriedandunmarriedpeoplethereare:
couples.where('Gender','male').group('MaritalStatus').barh("MaritalStatus"
ComputationalandInferentialThinking
334Permutation
Societalnormshavechangedoverthedecades,andtherehasbeenagradualacceptanceofcoupleslivingtogetherwithoutbeingmarried.Thusitisnaturaltoexpectthatunmarriedcoupleswillingeneralconsistofyoungerpeoplethanmarriedcouples.
Thehistogramsoftheagesofthemarriedandunmarriedmenshowthatthisisindeedthecase.Wewilldrawthesehistogramsandcomparethem.Inordertocomparetwohistograms,bothshouldbedrawntothesamescale.Letuswriteafunctionthatdoesthisforus.
defplot_age(table,subject):
"""
DrawsahistogramoftheAgecolumninthegiventable.
tableshouldbeaTablewithacolumnofpeople'sagescalledAge.
subjectshouldbeastring--thenameofthegroupwe'redisplaying,
like"marriedmen".
"""
#Drawahistogramofagesrunningfrom15yearsto70years
table.hist('Age',bins=np.arange(15,71,5),unit='year')
#Setthelowerandupperboundsoftheverticalaxissothat
#theplotswemakeareallcomparable.
plt.ylim(0,0.045)
plt.title("Agesof"+subject)
ComputationalandInferentialThinking
335Permutation
#Agesofmen:
plot_age(married_men,"marriedmen")
plot_age(partnered_men,"cohabitingunmarriedmen")
Thedifferenceisevenmoremarkedwhenwecomparethemarriedandunmarriedwomen.
married_women=couples.where('Gender','female').where('MaritalStatus'
partnered_women=couples.where('Gender','female').where('MaritalStatus'
#Agesofwomen:
plot_age(married_women,"marriedwomen")
plot_age(partnered_women,"cohabitingunmarriedwomen")
ComputationalandInferentialThinking
336Permutation
Thehistogramsshowthatthemarriedmeninthesampleareingeneralolderthanunmarriedcohabitingmen.Marriedwomenareingeneralolderthanunmarriedwomen.Theseobservationsareconsistentwithwhatwehadpredictedbasedonchangingsocialnorms.
Ifmarriedcouplesareingeneralolder,theymightdifferfromunmarriedcouplesinotherwaysaswell.Letuscomparetheemploymentstatusofthemarriedandunmarriedmeninthesample.
Thetablebelowshowsthemaritalstatusandemploymentstatusofeachmaninthesample.
males=couples.where('Gender','male').select(['MaritalStatus','EmploymentStatus'
males.show(5)
ComputationalandInferentialThinking
337Permutation
MaritalStatus EmploymentStatus
married workingaspaidemployee
married workingaspaidemployee
married workingaspaidemployee
married working,self-employed
married notworking-other
...(1028rowsomitted)
ContingencyTables¶
Toinvestigatetheassociationbetweenemploymentandmarriage,wewouldliketobeabletoaskquestionslike,"Howmanymarriedmenareretired?"
Recallthatthemethodpivotletsusdoexactlythat.Itcross-classifieseachmanaccordingtothetwovariables–maritalstatusandemploymentstatus.Itsoutputisacontingencytablethatcontainsthecountsineachpairofcategories.
employment=males.pivot('MaritalStatus','EmploymentStatus')
employment
EmploymentStatus married partner
notworking-disabled 44 20
notworking-lookingforwork 28 33
notworking-onatemporarylayofffromajob 15 8
notworking-other 16 9
notworking-retired 44 4
workingaspaidemployee 513 170
working,self-employed 82 47
Theargumentsofpivotarethelabelsofthetwocolumnscorrespondingtothevariableswearestudying.Categoriesofthefirstargumentappearascolumns;categoriesofthesecondargumentaretherows.Eachcellofthetablecontainsthenumberofmeninapairofcategories–aparticularemploymentstatusandaparticularmaritalstatus.
Thetableshowsthatregardlessofmaritalstatus,themeninthesamplearemostlikelytobeworkingaspaidemployees.Butitisquitehardtocomparetheentiredistributionsbasedonthistable,becausethenumbersofmarriedandunmarriedmeninthesamplearenotthe
ComputationalandInferentialThinking
338Permutation
same.Thereare742marriedmenbutonly291unmarriedones.
employment.drop(0).sum()
married partner
742 291
Toadjustforthisdifferenceintotalnumbers,wewillconvertthecountsintoproportions,bydividingallthemarriedcountsby742andallthepartnercountsby291.
proportions=Table().with_columns([
"EmploymentStatus",cc.column("EmploymentStatus"),
"married",employment.column('married')/sum(employment.column
"partner",employment.column('partner')/sum(employment.column
])
proportions.show()
EmploymentStatus married partner
notworking-disabled 0.0592992 0.0687285
notworking-lookingforwork 0.0377358 0.113402
notworking-onatemporarylayofffromajob 0.0202156 0.0274914
notworking-other 0.0215633 0.0309278
notworking-retired 0.0592992 0.0137457
workingaspaidemployee 0.691375 0.584192
working,self-employed 0.110512 0.161512
Themarriedcolumnofthistableshowsthedistributionofemploymentstatusofthemarriedmeninthesample.Forexample,amongmarriedmen,theproportionwhoareretiredisabout0.059.Thepartnercolumnshowsthedistributionoftheemploymentstatusoftheunmarriedmeninthesample.Amongunmarriedmen,theproportionwhoareretiredisabout0.014.
Thetwodistributionslookdifferentfromeachotherinotherwaystoo,ascanbeseenmoreclearlyinthebargraphsbelow.Itappearsthatalargerproportionofthemarriedmeninthesampleworkaspaidemployees,whereasalargerproportionoftheunmarriedmenarenotworkingbutarelookingforwork.
ComputationalandInferentialThinking
339Permutation
proportions.barh('EmploymentStatus')
Thedistributionsofemploymentstatusofthemeninthetwogroups–marriedandunmarried–isclearlydifferentinthesample.
Arethetwodistributionsdifferentinthepopulation?¶
Thisraisesthequestionofwhetherthedifferenceisduetorandomnessinthesampling,orwhetherthedistributionsofemploymentstatusareindeeddifferentformarriedandumarriedcohabitingmenintheU.S.Rememberthatthedatathatwehavearefromasampleofjust1,033couples;wedonotknowthedistributionofemploymentstatusofmarriedorunmarriedcohabitingmenintheentirecountry.
Wecananswerthequestionbyperformingastatisticaltestofhypotheses.Letususetheterminologythatwedevelopedforthisintheprevioussection.
Nullhypothesis.IntheUnitedStates,thedistributionofemploymentstatusamongmarriedmenisthesameasamongunmarriedmenwholivewiththeirpartners.
Anotherwayofsayingthisisthatemploymentstatusandmaritalstatusareindependentornotassociated.
Ifthenullhypothesisweretrue,thenthedifferencethatwehaveobservedinthesamplewouldbejustduetochance.
Alternativehypothesis.IntheUnitedStates,thedistributionsoftheemploymentstatusofthetwogroupsofmenaredifferent.Inotherwords,employmentstatusandmaritalstatusareassociatedinsomeway.
Asourteststatistic,wewillusethetotalvariationdistancebetweentwodistributions.
Theobservedvalueoftheteststatisticisabout0.15:
ComputationalandInferentialThinking
340Permutation
#TVDbetweenthetwodistributionsinthesample
married=proportions.column('married')
partner=proportions.column('partner')
observed_tvd=0.5*sum(abs(married-partner))
observed_tvd
0.15273571011754242
RandomPermutations¶
Inordertocomparethisobservedvalueofthetotalvariationdistancewithwhatispredictedbythenullhypothesis,weneedtoknowhowthetotalvariationdistancewouldvaryacrossallpossiblerandomsamplesifemploymentstatusandmaritalstatuswerenotrelated.
Thisisquitedauntingtoderivebymathematics,butletusseeifwecangetagoodapproximationbysimulation.
Withjustonesampleathand,andnofurtherknowledgeofthedistributionofemploymentstatusamongmenintheUnitedStates,howcanwegoaboutreplicatingthesamplingprocedure?Thekeyistonotethatifmaritalstatusandemploymentstatuswerenotconnectedinanyway,thenwecouldreplicatethesamplingprocessbyreplacingeachman'semploymentstatusbyarandomlypickedemploymentstatusfromamongallthemen,marriedandunmarried.
Doingthisforallthemenisequivalenttorandomlyrearrangingtheentirecolumncontainingemploymentstatus,whileleavingthemaritalstatuscolumnunchanged.Sucharearrangementiscalledarandompermutation.
Thus,underthenullhypothesis,wecanreplicatethesamplingprocessbyassigningtoeachmananemploymentstatuschosenatrandomwithoutreplacementfromtheentriesinthecolumnEmploymentStatus.WecandothereplicationbysimplypermutingtheentireEmploymentStatuscolumnandleavingeverythingelseunchanged.
Let'simplementthisplan.First,wewillshufflethecolumnempl_statususingthesamplemethod,whichjustshufflesalltherowswhenprovidedwithnoarguments.
#Randomlypermutetheemploymentstatusofallmen
shuffled=males.select('EmploymentStatus').sample()
shuffled
ComputationalandInferentialThinking
341Permutation
EmploymentStatus
working,self-employed
working,self-employed
workingaspaidemployee
workingaspaidemployee
working,self-employed
workingaspaidemployee
notworking-disabled
workingaspaidemployee
workingaspaidemployee
workingaspaidemployee
...(1023rowsomitted)
Thefirsttwocolumnsofthetablebelowaretakenfromtheoriginalsample.ThethirdhasbeencreatedbyrandomlypermutingtheoriginalEmploymentStatuscolumn.
#Constructatableinwhichemploymentstatushasbeenshuffled
males_with_shuffled_empl=Table().with_columns([
"MaritalStatus",males.column('MaritalStatus'),
"EmploymentStatus",males.column('EmploymentStatus'),
"EmploymentStatus(shuffled)",shuffled.column('EmploymentStatus'
])
males_with_shuffled_empl
ComputationalandInferentialThinking
342Permutation
MaritalStatus EmploymentStatus EmploymentStatus
(shuffled)
married workingaspaidemployee working,self-employed
married workingaspaidemployee working,self-employed
married workingaspaidemployee workingaspaidemployee
married working,self-employed workingaspaidemployee
married notworking-other working,self-employed
married notworking-onatemporarylayofffromajob workingaspaidemployee
married notworking-disabled notworking-disabled
married workingaspaidemployee workingaspaidemployee
married workingaspaidemployee workingaspaidemployee
married notworking-retired workingaspaidemployee
...(1023rowsomitted)
Onceagain,thepivotmethodcomputesthecontingencytable,whichallowsustocalculatethetotalvariationdistancebetweenthedistributionsofthetwogroupsofmenaftertheiremploymentstatushasbeenshuffled.
employment_shuffled=males_with_shuffled_empl.pivot('MaritalStatus'
employment_shuffled
EmploymentStatus(shuffled) married partner
notworking-disabled 48 16
notworking-lookingforwork 44 17
notworking-onatemporarylayofffromajob 16 7
notworking-other 18 7
notworking-retired 39 9
workingaspaidemployee 489 194
working,self-employed 88 41
ComputationalandInferentialThinking
343Permutation
#TVDbetweenthetwodistributionsinthecontingencytableabove
e_s=employment_shuffled
married=e_s.column('married')/sum(e_s.column('married'))
partner=e_s.column('partner')/sum(e_s.column('partner'))
0.5*sum(abs(married-partner))
0.032423745611841297
Thistotalvariationdistancewascomputedbasedonthenullhypothesisthatthedistributionsofemploymentstatusforthetwogroupsofmenarethesame.Youcanseethatitisnoticeablysmallerthantheobservedvalueofthetotalvariationdistance(0.15)betweenthetwogroupsinouroriginalsample.
APermutationTest¶
Couldthisjustbeduetochancevariation?Wewillonlyknowifwerunmanymorereplications,byrandomlypermutingtheEmploymentStatuscolumnrepeatedly.Thismethodoftestingisknownasapermutationtest.
ComputationalandInferentialThinking
344Permutation
#Putitalltogetherinaforlooptoperformapermutationtest
repetitions=500
tvds=Table().with_column("TVDbetweenmarriedandpartneredmen",
foriinnp.arange(repetitions):
#Constructapermutedtable
shuffled=males.select('EmploymentStatus').sample()
combined=Table().with_columns([
"MaritalStatus",males.column('MaritalStatus'),
"EmploymentStatus",shuffled.column('EmploymentStatus'
])
employment_shuffled=combined.pivot('MaritalStatus','EmploymentStatus'
#ComputeTVD
e_s=employment_shuffled
married=e_s.column('married')/sum(e_s.column('married'))
partner=e_s.column('partner')/sum(e_s.column('partner'))
permutation_tvd=0.5*sum(abs(married-partner))
tvds.append([permutation_tvd])
tvds.hist(bins=np.arange(0,0.2,0.01))
Thefigureaboveistheempiricaldistributionofthetotalvariationdistancebetweenthedistributionsoftheemploymentstatusofmarriedandunmarriedmen,underthenullhypothesis.
ComputationalandInferentialThinking
345Permutation
Theobservedteststatisticof0.15isquitefarinthetail,andsothechanceofobservingsuchanextremevalueunderthenullhypothesisiscloseto0.
Asbefore,thischanceiscalledanempiricalP-value.TheP-valueisthechancethatourteststatistic(TVD)wouldcomeoutatleastasextremeastheobservedvalue(inthiscase0.15orgreater)underthenullhypothesis.
Conclusionofthetest¶
Ourempiricalestimatebasedonrepeatedsamplinggivesusalltheinformationweneedfordrawingconclusionsfromthedata:theobservedstatisticisveryunlikelyunderthenullhypothesis.
ThelowP-valueconstitutesevidenceinfavorofthealternativehypothesis.ThedatasupportthehypothesisthatintheUnitedStates,thedistributionoftheemploymentstatusofmarriedmenisnotthesameasthatofunmarriedmenwholivewiththeirpartners.
Wehavejustcompletedourfirstpermutationtest.Permutationtestsarequitecommoninpracticebecausetheymakeveryfewassumptionsabouttheunderlyingpopulationandarestraightforwardtoperformandinterpret.
NoteabouttheapproximateP-value
OursimulationgivesusanapproximateempiricalP-value,becauseitisbasedonjust500randompermutationsinsteadofallthepossiblerandompermutations.WecancomputethisempiricalP-valuedirectly,withoutdrawingthehistogram:
empirical_p_value=np.count_nonzero(tvds.column(0)>=observed_tvd
empirical_p_value
0.0
ComputingtheexactP-valuewouldrequireustoconsiderallpossibleoutcomesofshuffling(whichisverylarge)insteadof500randomshuffles.Ifwehadperformedalltherandomshuffles,therewouldhavebeenafewwithmoreextremeTVDs.ThetrueP-valueisgreaterthanzero,butnotbymuch.
GeneralizingOurHypothesisTest¶
ComputationalandInferentialThinking
346Permutation
Theexampleaboveincludesasubstantialamountofcodeinordertoinvestigatetherelationshipbetweentwocharacteristics(maritalstatusandemploymentstatus)foraparticularsubsetofthesurveyedpopulation(males).Supposewewouldliketoinvestigatedifferentcharacteristicsoradifferentpopulation.Howcanwereusethecodewehavewrittensofarinordertoexploremorerelationships?
Whenyouareabouttocopyyourcode,youshouldthink,"MaybeIshouldwritesomefunctions."
Whatfunctionstowrite?Agoodwaytomakethisdecisionistothinkaboutwhatyouhavetocomputerepeatedly.
Inourexample,thetotalvariationdistanceiscomputedoverandoveragain.Sowewillbeginwithageneralizedcomputationoftotalvariationdistancebetweenthedistributionofanycolumnofvalues(suchasemploymentstatus)whenseparatedintoanytwoconditions(suchasmaritalstatus)foracollectionofdatadescribedbyanytable.Ourimplementationincludesthesamestatementsasweusedabove,butusesgenericnamesthatarespecifiedbythefinalfunctioncall.
#TVDbetweenthedistributionsofvaluesunderanytwoconditions
deftvd(t,conditions,values):
"""Computethetotalvariationdistance
betweenproportionsofvaluesundertwoconditions.
t(Table)--atable
conditions(str)--acolumnlabelint;shouldhaveonlytwocategories
values(str)--acolumnlabelint
"""
e=t.pivot(conditions,values)
a=e.column(1)
b=e.column(2)
return0.5*sum(abs(a/sum(a)-b/sum(b)))
tvd(males,'MaritalStatus','EmploymentStatus')
0.15273571011754242
ComputationalandInferentialThinking
347Permutation
Next,wecanwriteafunctionthatperformsapermutationtestusingthistvdfunctiontocomputethesamestatisticonshuffledvariantsofanytable.It'sworthreadingthroughthisimplementationtounderstanditsdetails.
defpermutation_tvd(original,conditions,values):
"""
Performapermutationtestofwhether
thedistributionofvaluesfortwoconditions
isthesameinthepopulation,
usingthetotalvariationdistancebetweentwodistributions
astheteststatistic.
originalisaTablewithtwocolumns.Thevalueoftheargument
conditionsisthenameofonecolumn,andthevalueoftheargument
valuesisthenameoftheothercolumn.Theconditionstableshould
haveonly2possiblevaluescorrespondingto2categoriesinthe
data.
Thevaluescolumnisshuffledmanytimes,andthedataaregrouped
accordingtotheconditionscolumn.Thetotalvariationdistance
betweentheproportionsvaluesinthe2categoriesiscomputed.
ThenwedrawahistogramofallthoseTVdistances.Thisshowsus
whattheTVDbetweenthevaluesofthetwodistributionswouldtypically
looklikeifthevalueswereindependentoftheconditions.
"""
repetitions=500
stats=[]
foriinnp.arange(repetitions):
shuffled=original.sample()
combined=Table().with_columns([
conditions,original.column(conditions),
values,shuffled.column(values)
])
stats.append(tvd(combined,conditions,values))
observation=tvd(original,conditions,values)
p_value=np.count_nonzero(stats>=observation)/repetitions
ComputationalandInferentialThinking
348Permutation
print("Observation:",observation)
print("EmpiricalP-value:",p_value)
Table([stats],['Empiricaldistribution']).hist()
permutation_tvd(males,'MaritalStatus','EmploymentStatus')
Observation:0.152735710118
EmpiricalP-value:0.0
Nowthatwehavegeneralizedourpermutationtest,wecanapplyittootherhypotheses.Forexample,wecancomparethedistributionovertheemploymentstatusofwomen,groupingthembytheirmaritalstatus.Inthecaseofmenwefoundadifference,butwhataboutwithwomen?First,wecanvisualizethetwodistributions.
defcompare_bar(t,conditions,values):
"""Bargraphsofdistributionsofvaluesforeachoftwoconditions."""
e=t.pivot(conditions,values)
forlabeline.drop(0).labels:
#Converteachcolumnofcountsintoproportions
e.append_column(label,e.column(label)/sum(e.column(label)))
e.barh(values)
compare_bar(couples.where('Gender','female'),'MaritalStatus','EmploymentStatus'
ComputationalandInferentialThinking
349Permutation
Aglanceatthefigureshowsthatthetwodistributionsaredifferentinthesample.Thedifferenceinthecategory"Notworking–other"isparticularlystriking:about22%ofthemarriedwomenareinthiscategory,comparedtoonlyabout8%oftheunmarriedwomen.Thereareseveralreasonsforthisdifference.Forexample,thepercentofhomemakersisgreateramongmarriedwomenthanamongunmarriedwomen,possiblybecausemarriedwomenaremorelikelytobe"stay-at-home"mothersofyoungchildren.Thedifferencecouldalsobegenerational:aswesawearlier,themarriedcouplesareolderthantheunmarriedpartners,andolderwomenarelesslikelytobeintheworkforcethanyoungerwomen.
Whilewecanseethatthedistributionsaredifferentinthesample,wearenotreallyinterestedinthesampleforitsownsake.Weareexaminingthesamplebecauseitislikelytoreflectthepopulation.So,asbefore,wewillusethesampletotrytoansweraquestionaboutsomethingunknown:thedistributionsofemploymentstatusofmarriedandunmarriedcohabitingwomenintheUnitedStates.Thatisthepopulationfromwhichthesamplewasdrawn.
Wehavetoconsiderthepossibilitythattheobserveddifferenceinthesamplecouldsimplybetheresultofchancevariation.Rememberthatourdataareonlyfromarandomsampleofcouples.WedonothavedataforallthecouplesintheUnitedStates.
Nullhypothesis:IntheU.S.,thedistributionofemploymentstatusisthesameformarriedwomenasforunmarriedwomenlivingwiththeirpartners.Thedifferenceinthesampleisduetochance.
Alternativehypothesis:IntheU.S.,thedistributionsofemploymentstatusamongmarriedandunmarriedcohabitingwomenaredifferent.
Anotherpermutationtesttocomparedistributions¶
Wecantestthesehypothesesjustaswedidformen,byusingthefunctionpermuation_tvdthatwedefinedforthispurpose.
permutation_tvd(couples.where('Gender','female'),'MaritalStatus'
ComputationalandInferentialThinking
350Permutation
Observation:0.194755513565
EmpiricalP-value:0.0
Asforthemales,theempiricalP-valueis0basedonalarenumberofrepetitions.SotheexactP-valueiscloseto0,whichisevidenceinfavorofthealternativehypothesis.ThedatasupportthehypothesisthatforwomenintheUnitedStates,employmentstatusisassociatedwithwhethertheyaremarriedorunmarriedandlivingwiththeirpartners.
Anotherexample¶
Aregenderandemploymentstatusindependentinthepopulation?Wearenowinapositiontotestthisquiteswiftly:
Nullhypothesis.AmongmarriedandunmarriedcohabitingindividualsintheUnitedStates,genderisindependentofemploymentstatus.
Alternativehypothesis.AmongmarriedandunmarriedcohabitingpeopleintheUnitedStates,genderandemploymentstatusarerelated.
permutation_tvd(couples,'Gender','EmploymentStatus')
Observation:0.185866408519
EmpiricalP-value:0.0
ComputationalandInferentialThinking
351Permutation
Theconclusionofthetestisthatgenderandemploymentstatusarenotindependentinthepopulation.Thisisnosurprise;forexample,becauseofsocietalnorms,olderwomenwerelesslikelytohavegoneintotheworkforcethanmen.
Deflategate:PermutationTestsandQuantitativeVariables¶
OnJanuary18,2015,theIndianapolisColtsandtheNewEnglandPatriotsplayedtheAmericanFootballConference(AFC)championshipgametodeterminewhichofthoseteamswouldplayintheSuperBowl.Afterthegame,therewereallegationsthatthePatriots'footballshadnotbeeninflatedasmuchastheregulationsrequired;theyweresofter.Thiscouldbeanadvantage,assofterballsmightbeeasiertocatch.
Forseveralweeks,theworldofAmericanfootballwasconsumedbyaccusations,denials,theories,andsuspicions:thepresslabeledthetopicDeflategate,aftertheWatergatepoliticalscandalofthe1970's.TheNationalFootballLeague(NFL)commissionedanindependentanalysis.Inthisexample,wewillperformourownanalysisofthedata.
Pressureisoftenmeasuredinpoundspersquareinch(psi).NFLrulesstipulatethatgameballsmustbeinflatedtohavepressuresintherange12.5psiand13.5psi.Eachteamplayswith12balls.Teamshavetheresponsibilityofmaintainingthepressureintheirownfootballs,butgameofficialsinspecttheballs.BeforethestartoftheAFCgame,allthePatriots'ballswereatabout12.5psi.MostoftheColts'ballswereatabout13.0psi.However,thesepre-gamedatawerenotrecorded.
Duringthesecondquarter,theColtsinterceptedaPatriotsball.Onthesidelines,theymeasuredthepressureoftheballanddeterminedthatitwasbelowthe12.5psithreshold.Promptly,theyinformedofficials.
ComputationalandInferentialThinking
352Permutation
Athalf-time,allthegameballswerecollectedforinspection.Twoofficials,CleteBlakemanandDyrolPrioleau,measuredthepressureineachoftheballs.Herearethedata;pressureismeasuredinpsi.ThePatriotsballthathadbeeninterceptedbytheColtswasnotinspectedathalf-time.NorweremostoftheColts'balls–theofficialssimplyranoutoftimeandhadtorelinquishtheballsforthestartofsecondhalfplay.
football=Table.read_table('football.csv')
football.show()
Team Ball Blakeman Prioleau
0 Patriots1 11.5 11.8
0 Patriots2 10.85 11.2
0 Patriots3 11.15 11.5
0 Patriots4 10.7 11
0 Patriots5 11.1 11.45
0 Patriots6 11.6 11.95
0 Patriots7 11.85 12.3
0 Patriots8 11.1 11.55
0 Patriots9 10.95 11.35
0 Patriots10 10.5 10.9
0 Patriots11 10.9 11.35
1 Colts1 12.7 12.35
1 Colts2 12.75 12.3
1 Colts3 12.5 12.95
1 Colts4 12.55 12.15
Foreachofthe15ballsthatwereinspected,thetwoofficialsgotdifferentresults.Itisnotuncommonthatrepeatedmeasurementsonthesameobjectyielddifferentresults,especiallywhenthemeasurementsareperformedbydifferentpeople.Sowewillassigntoeachtheballtheaverageofthetwomeasurementsmadeonthatball.
ComputationalandInferentialThinking
353Permutation
football=football.with_column(
'Combined',(football.column('Blakeman')+football.column('Prioleau'
)
football.show()
Team Ball Blakeman Prioleau Combined
0 Patriots1 11.5 11.8 11.65
0 Patriots2 10.85 11.2 11.025
0 Patriots3 11.15 11.5 11.325
0 Patriots4 10.7 11 10.85
0 Patriots5 11.1 11.45 11.275
0 Patriots6 11.6 11.95 11.775
0 Patriots7 11.85 12.3 12.075
0 Patriots8 11.1 11.55 11.325
0 Patriots9 10.95 11.35 11.15
0 Patriots10 10.5 10.9 10.7
0 Patriots11 10.9 11.35 11.125
1 Colts1 12.7 12.35 12.525
1 Colts2 12.75 12.3 12.525
1 Colts3 12.5 12.95 12.725
1 Colts4 12.55 12.15 12.35
Ataglance,itseemsapparentthatthePatriots'footballswereatalowerpressurethantheColts'balls.Becausesomedeflationisnormalduringthecourseofagame,theindependentanalystsdecidedtocalculatethedropinpressurefromthestartofthegame.RecallthatthePatriots'ballshadallstartedoutatabout12.5psi,andtheColts'ballsatabout13.0psi.ThereforethedropinpressureforthePatriots'ballswascomputedas12.5minusthepressureathalf-time,andthedropinpressurefortheColts'ballswas13.0minusthepressureathalf-time.
ComputationalandInferentialThinking
354Permutation
football=football.with_column(
'Drop',np.array([12.5]*11+[13.0]*4)-football.column('Combined'
)
football.show()
Team Ball Blakeman Prioleau Combined Drop
0 Patriots1 11.5 11.8 11.65 0.85
0 Patriots2 10.85 11.2 11.025 1.475
0 Patriots3 11.15 11.5 11.325 1.175
0 Patriots4 10.7 11 10.85 1.65
0 Patriots5 11.1 11.45 11.275 1.225
0 Patriots6 11.6 11.95 11.775 0.725
0 Patriots7 11.85 12.3 12.075 0.425
0 Patriots8 11.1 11.55 11.325 1.175
0 Patriots9 10.95 11.35 11.15 1.35
0 Patriots10 10.5 10.9 10.7 1.8
0 Patriots11 10.9 11.35 11.125 1.375
1 Colts1 12.7 12.35 12.525 0.475
1 Colts2 12.75 12.3 12.525 0.475
1 Colts3 12.5 12.95 12.725 0.275
1 Colts4 12.55 12.15 12.35 0.65
Itisapparentthatthedropwaslarger,onaverage,forthePatriots'footballs.Butcouldthedifferencebejustduetochance?
Toanswerthis,wemustfirstexaminehowchancemightentertheanalysis.Thisisnotasituationinwhichthereisarandomsampleofdatafromalargepopulation.Itisalsonotclearhowtocreateajustifiableabstractchancemodel,astheballswerealldifferent,inflatedbydifferentpeople,andmaintainedunderdifferentconditions.
Onewaytointroducechancesistoaskwhetherthedropsinpressuresofthe11Patriotsballsandthe4Coltsballsresemblearandompermutationofthe15drops.Thenthe4Coltsdropswouldbeasimplerandomsampleofall15drops.Thisgivesusanullhypothesisthatwecantestusingrandompermutations.
ComputationalandInferentialThinking
355Permutation
Nullhypothesis.Thedropsinthepressuresofthe4Coltsballsarelikearandomsample(withoutreplacement)fromall15drops.
Anewteststatistic¶
Thedataarequantitative,sowecannotcomparethetwodistributionscategorybycategoryusingthetotalvariationdistance.IfwetrytobinthedatainordertousetheTVD,thechoiceofbinscananoticeableeffectonthestatistic.Soinstead,wewillworkwithasimplestatisticbasedonmeans.Wewilljustcomparetheaveragedropsinthetwogroups.
Theobserveddifferencebetweentheaveragedropsinthetwogroupswasabout0.7335psi.
patriots=football.where('Team',0).column('Drop')
colts=football.where('Team',1).column('Drop')
observed_difference=np.mean(patriots)-np.mean(colts)
observed_difference
0.73352272727272805
Nowthequestionbecomes:Ifwetookarandompermutationofthe15drops,howlikelyisitthatthedifferenceinthemeansofthefirst11andthelast4wouldbeatleastaslargeasthedifferenceobservedbytheofficials?
Toanswerthis,wewillrandomlypermutethe15drops,assignthefirst11permutedvaluestothePatriotsandthelast4totheColts.Thenwewillfindthedifferenceinthemeansofthetwopermutedgroups.
drops_shuffled=football.sample().column('Drop')
patriots_shuffled=drops_shuffled[:10]
colts_shuffled=drops_shuffled[11:]
shuffled_difference=np.mean(patriots_shuffled)-np.mean(colts_shuffled
shuffled_difference
0.010000000000000675
ComputationalandInferentialThinking
356Permutation
Thisisdifferentfromtheobservedvaluewecalculatedearlier.Buttogetabettersenseofthevariabilityunderrandomsamplingwemustrepeattheprocessmanytimes.Letustrymaking5000repetitionsanddrawingahistogramofthe5000differencesbetweenmeans.
repetitions=5000
test_stats=[]
foriinrange(repetitions):
drops_shuffled=football.sample().column('Drop')
patriots_shuffled=drops_shuffled[:10]
colts_shuffled=drops_shuffled[11:]
shuffled_difference=np.mean(patriots_shuffled)-np.mean(colts_shuffled
test_stats.append(shuffled_difference)
observation=np.mean(patriots)-np.mean(colts)
emp_p_value=np.count_nonzero(test_stats>=observation)/repetitions
differences=Table().with_column('DifferenceinMeans',test_stats
differences.hist(bins=20)
print("Observation:",observation)
print("EmpiricalP-value:",emp_p_value)
Observation:0.733522727273
EmpiricalP-value:0.0028
ComputationalandInferentialThinking
357Permutation
Theobserveddifferencewasroughly0.7335psi.Accordingtotheempiricaldistributionabove,thereisaverysmallchancethatarandompermutationwouldyieldadifferencethatlarge.Sothedatasupporttheconclusionthatthetwogroupsofpressureswerenotlikearandompermutationofall15pressures.
Theindependentinvestiagtiveteamanalyzedthedatainseveraldifferentways,takingintoaccountthelawsofphysics.Thefinalreportsaid,
"[T]heaveragepressuredropofthePatriotsgameballsexceededtheaveragepressuredropoftheColtsballsby0.45to1.02psi,dependingonvariouspossibleassumptionsregardingthegaugesused,andassuminganinitialpressureof12.5psiforthePatriotsballsand13.0fortheColtsballs."
--InvestigativereportcommissionedbytheNFLregardingtheAFCChampionshipgameonJanuary18,2015
Ouranalysisshowsanaveragepressuredropofabout0.73psi,whichisconsistentwiththeofficialanalysis.
Theall-importantquestioninthefootballworldwaswhethertheexcessdropofpressureinthePatriots'footballswasdeliberate.Tothatquestion,thedatahavenoanswer.Ifyouarecuriousabouttheanswergivenbytheinvestigators,hereisthefullreport.
ComputationalandInferentialThinking
358Permutation
A/BTestingInteract
UsingtheBootstrapMethodtoTestHypotheses¶Wehaveusedrandompermutationstoseewhethertwosamplesaredrawnfromthesameunderlyingdistribution.Thebootstrapmethodcanalsobeusedinsuchstatisticaltestsofhypotheses.
Theexamplesinthissectionarebasedondataonarandomsampleof1,174pairsofmothersandtheirnewborninfants.Thetablebabycontainsthedata.Eachrowrepresentsamother-babypair.Thevariablesare:
thebaby'sbirthweightinouncesthenumberofgestationaldaysthemother'sageincompletedyearsthemother'sheightininchesthemother'spregnancyweightinpoundswhetherornotthemotherwasasmoker
baby=Table.read_table('baby.csv')
baby
ComputationalandInferentialThinking
359A/BTesting
BirthWeight
GestationalDays
MaternalAge
MaternalHeight
MaternalPregnancyWeight
MaternalSmoker
120 284 27 62 100 False
113 282 33 64 135 False
128 279 28 64 115 True
108 282 23 67 125 True
136 286 25 62 93 False
138 244 33 62 178 False
132 245 23 65 140 False
120 289 25 62 125 False
143 299 30 66 136 True
140 351 27 68 120 False
...(1164rowsomitted)
Supposewewanttocomparethemotherswhosmokeandthemotherswhoarenon-smokers.Dotheydifferinanywayotherthansmoking?Akeyvariableofinterestisthebirthweightoftheirbabies.Tostudythisvariable,webeginbynotingthatthereare715non-smokersamongthewomeninthesample,and459smokers.
baby.group('MaternalSmoker')
MaternalSmoker count
False 715
True 459
Thefirsthistogrambelowdisplaysthedistributionofbirthweightsofthebabiesofthenon-smokersinthesample.Theseconddisplaysthebirthweightsofthebabiesofthesmokers.
baby.where('MaternalSmoker',False).hist('BirthWeight',bins=np.arange
ComputationalandInferentialThinking
360A/BTesting
baby.where('MaternalSmoker',True).hist('BirthWeight',bins=np.arange
Bothdistributionsareapproximatelybellshapedandcenterednear120ounces.Thedistributionsarenotidentical,ofcourse,whichraisesthequestionofwhetherthedifferencereflectsjustchancevariationoradifferenceinthedistributionsinthepopulation.
Thisquestioncanbeansweredbyatestofthenullandalternativehypothesesbelow.Becausewearetestingfortheequalityoftwodistributions,oneforCategoryA(motherswhodon'tsmoke)andtheotherforCategoryB(motherswhosmoke),themethodisknownratherunimaginativelyasA/Btesting.
Nullhypothesis.Inthepopulation,thedistributionofbirthweightsofbabiesisthesameformotherswhodon'tsmokeasformotherswhodo.Thedifferenceinthesampleisduetochance.
Alternativehypothesis.Thetwodistributionsaredifferentinthepopulation.
ComputationalandInferentialThinking
361A/BTesting
Teststatistic.Birthweightisaquantitativevariable,soitisreasonabletousethedifferencebetweenmeansastheteststatistic.Thereareotherreasonableteststatisticsaswell;thedifferencebetweenmeansisjustonesimplechoice.
Theobserveddifferencebetweenthemeansofthetwogroupsinthesampleisabout9.27ounces.
nonsmokers_mean=np.mean(baby.where('MaternalSmoker',False).column
smokers_mean=np.mean(baby.where('MaternalSmoker',True).column('BirthWeight'
nonsmokers_mean-smokers_mean
9.2661425720249184
Implementingthebootstrapmethod¶
Toseewhethersuchadifferencecouldhavearisenduetochanceunderthenullhypothesis,wecoulduseapermutationtestjustaswedidintheprevioussection.Analternativeistousethebootstrap,whichwewilldohere.
Underthenullhypothesis,thedistributionsofbirthweightsarethesameforbabiesofwomenwhosmokeandforbabiesofthosewhodon't.Toseehowmuchthedistributionscouldvaryduetochancealone,wewillusethebootstrapmethodanddrawnewsamplesatrandomwithreplacementfromtheentiresetofbirthweights.Theonlydifferencebetweenthisandthepermutationtestoftheprevioussectionisthatrandompermutationsarebasedonsamplingwithoutreplacement.
Wewillperform10,000repetitionsofthebootstrapprocess.Thereare1,174babies,soineachrepetitionofthebootstrapprocesswedrawn1,174timesatrandomwithreplacement.Ofthese1,174draws,715areassignedtothenon-smokingmothersandtheremaining459tothemotherswhosmoke.
Thecodestartsbysortingtherowssothattherowscorrespondingtothe715non-smokersappearfirst.Thiseliminatestheneedtouseconditionalexpressionstoidentifytherowscorrespondingtosmokersandnon-smokersineachreplicationofthesamplingprocess.
ComputationalandInferentialThinking
362A/BTesting
"""Bootstraptestforthedifferenceinmeanbirthweight
CategoryA:non-smokerCategoryB:smoker"""
sorted_by_smoking=baby.select(['MaternalSmoker','BirthWeight'])
#calculatetheobserveddifferenceinmeans
meanA=np.mean(sorted_by_smoking.column('BirthWeight')[:715])
meanB=np.mean(sorted_by_smoking.column('BirthWeight')[715:])
observed_difference=meanA-meanB
repetitions=10000
diffs=[]
foriinrange(repetitions):
#sampleWITHreplacement,samenumberasoriginalsamplesize
resample=sorted_by_smoking.sample(with_replacement=True)
#Computethedifferenceofthemeansoftheresampledvalues,betweenCategoriesAandB
diff_means=np.mean(resample.column('BirthWeight')[:715])-np
diffs.append([diff_means])
#Displayresults
differences=Table().with_column('diff_in_means',diffs)
differences.hist(bins=20)
plots.xlabel('Approxnulldistributionofdifferenceinmeans')
plots.title('BootstrapA/BTest')
plots.xlim(-4,4)
plots.ylim(0,0.4)
print('Observeddifferenceinmeans:',observed_difference)
Observeddifferenceinmeans:9.26614257202
ComputationalandInferentialThinking
363A/BTesting
Thefigureshowsthatunderthenullhypothesisofequaldistributionsinthepopulation,thebootstrapempiricaldistributionofthedifferencebetweenthesamplemeansofthetwogroupsisroughlybellshaped,centeredat0,stretchingfromabout ouncesto ounces.Theobserveddifferenceintheoriginalsampleisabout9.27ounces,whichisinconsistentwiththisdistribution.Sotheconclusionofthetestisthatinthepopulation,thedistributionsofbirthweightsofthebabiesofnon-smokersandsmokersaredifferent.
BootstrapA/Btesting¶
Letusdefineafunctionbootstrap_AB_testthatperformsanA/Btestusingthebootstrapmethodandthedifferenceinsamplemeansastheteststatistic.Thenullhypothesisisthatthetwounderlyingdistributionsinthepopulationareequal;thealternativeisthattheyarenot.
Theargumentsofthefunctionare:
thenameofthetablethatcontainsthedataintheoriginalsamplethelabelofthecolumncontainingthecode0forCategoryAand1forCategoryBthelabelofthecolumncontainingthevaluesoftheresponsevariable(thatis,thevariablewhosedistributionisofinterest)thenumberofrepetitionsoftheresamplingprocedure
Thefunctionreturnstheobserveddifferenceinmeans,thebootstrapempiricaldistributionofthedifferenceinmeans,andthebootstrapempiricalP-value.Becausethealternativesimplysaysthatthetwounderlyingdistributionsaredifferent,theP-valueiscomputedastheproportionofsampleddifferencesthatareatleastaslargeinabsolutevalueastheabsolutevalueoftheobserveddifference.
ComputationalandInferentialThinking
364A/BTesting
"""BootstrapA/Btestforthedifferenceinthemeanresponse
AssumesA=0,B=1"""
defbootstrap_AB_test(table,categories,values,repetitions):
#SortthesampletableaccordingtotheA/Bcolumn;
#thenselectonlythecolumnofeffects.
response=table.sort(categories).select(values)
#FindthenumberofentriesinCategoryA.
n_A=table.where(categories,0).num_rows
#Calculatetheobservedvalueoftheteststatistic.
meanA=np.mean(response.column(values)[:n_A])
meanB=np.mean(response.column(values)[n_A:])
observed_difference=meanA-meanB
#Runthebootstrapprocedureandgetalistofresampleddifferencesinmeans
diffs=[]
foriinrange(repetitions):
resample=response.sample(with_replacement=True)
d=np.mean(resample.column(values)[:n_A])-np.mean(resample
diffs.append([d])
#ComputethebootstrapempiricalP-value
diff_array=np.array(diffs)
empirical_p=np.count_nonzero(abs(diff_array)>=abs(observed_difference
#Displayresults
differences=Table().with_column('diff_in_means',diffs)
differences.hist(bins=20,normed=True)
plots.xlabel('Approxnulldistributionofdifferenceinmeans')
plots.title('BootstrapA/BTest')
print("Observeddifferenceinmeans:",observed_difference)
print("BootstrapempiricalP-value:",empirical_p)
Wecannowusethefunctionbootstrap_AB_testtocomparethesmokersandnon-smokerswithrespecttoseveraldifferentresponsevariables.Thetestsshowastatisticallysignificantdifferencebetweenthetwogroupsinbirthweight(asshownearlier),gestationaldays,
ComputationalandInferentialThinking
365A/BTesting
maternalage,andmaternalpregnancyweight.Itcomesasnosurprisethatthetwogroupsdonotdiffersignificantlyintheirmeanheights.
bootstrap_AB_test(baby,'MaternalSmoker','BirthWeight',10000)
Observeddifferenceinmeans:9.26614257202
BootstrapempiricalP-value:0.0
bootstrap_AB_test(baby,'MaternalSmoker','GestationalDays',10000
Observeddifferenceinmeans:1.97652238829
BootstrapempiricalP-value:0.0367
ComputationalandInferentialThinking
366A/BTesting
bootstrap_AB_test(baby,'MaternalSmoker','MaternalAge',10000)
Observeddifferenceinmeans:0.80767250179
BootstrapempiricalP-value:0.0192
bootstrap_AB_test(baby,'MaternalSmoker','MaternalPregnancyWeight'
Observeddifferenceinmeans:2.56033030151
BootstrapempiricalP-value:0.0374
ComputationalandInferentialThinking
367A/BTesting
bootstrap_AB_test(baby,'MaternalSmoker','MaternalHeight',10000
Observeddifferenceinmeans:-0.0905891494127
BootstrapempiricalP-value:0.5411
ComputationalandInferentialThinking
368A/BTesting
RegressionInferenceInteract
Regression:aReview¶Earlierinthecourse,wedevelopedtheconceptsofcorrelationandregressionaswaystodescriberelationsbetweenquantitativevariables.Wewillnowseehowtheseconceptscanbecomepowerfultoolsforinference,whenusedappropriately.
First,letusbrieflyreviewthemainideasofcorrelationandregression,alongwithcodeforcalculations.
Itisoftenconvenienttomeasurevariablesinstandardunits.Whenyouconvertavaluetostandardunits,youaremeasuringhowmanystandarddeviationsaboveaveragethevalueis.
Thefunctionstandard_unitstakesanarrayasitsargumentandreturnsthearrayconvertedtostandardunits.
defstandard_units(x):
return(x-np.mean(x))/np.std(x)
Thecorrelationbetweentwovariablesmeasuresthedegreetowhichascatterplotofthetwovariablesisclusteredaroundastraightline.Ithasnounitsandisanumberbetween-1and1.Itiscalculatedastheaverageoftheproductofthetwovariables,whenbothvariablesaremeasuredinstandardunits.
Thefunctioncorrelationtakesthreearguments–thenameofatableandthelabelsoftwocolumnsofthetable–andreturnsthecorrelationbetweenthetwocolumns.
defcorrelation(table,x,y):
x_in_standard_units=standard_units(table.column(x))
y_in_standard_units=standard_units(table.column(y))
returnnp.mean(x_in_standard_units*y_in_standard_units)
Hereisthescatterplotofbirthweight(they-variable)versusgestationaldays(thex-variable)ofthebabiesinthetablebaby.Theplotshowsapositiveassociation,andcorrelationreturnsavalueofjustover0.4.
ComputationalandInferentialThinking
369RegressionInference
correlation(baby,'GestationalDays','BirthWeight')
0.40754279338885108
baby.scatter('GestationalDays','BirthWeight')
Whenscatteriscalledwiththeoptionfit_line=True,theregressionlineisdrawnonthescatterplot.
baby.scatter('GestationalDays','BirthWeight',fit_line=True)
ComputationalandInferentialThinking
370RegressionInference
Theregressionlineisthebestamongallstraightlinesthatcouldbeusedtoestimatebasedon ,inthesensethatitminimizesthemeansquarederrorofestimation.
Theslopeoftheregressionlineisgivenby:
where isthecorrelation.
Theinterceptoftheregressionlineisgivenby
Thefunctionsslopeandinterceptcalculatethesetwoquantities.Eachtakesthesamethreeargumentsascorrelation:thenameofthetableandthelabelsofcolumns and ,inthatorder.
defslope(table,x,y):
r=correlation(table,x,y)
returnr*np.std(table.column(y))/np.std(table.column(x))
defintercept(table,x,y):
a=slope(table,x,y)
returnnp.mean(table.column(y))-a*np.mean(table.column(x))
ComputationalandInferentialThinking
371RegressionInference
Wecannowcalculatetheslopeandtheinterceptoftheregressionlinedrawnabove.Theslopeisabout0.47ouncespergestationalday.
slope(baby,'GestationalDays','BirthWeight')
0.46655687694921522
Theinterceptisabout-10.75ounces.
intercept(baby,'GestationalDays','BirthWeight')
-10.754138914450252
Thustheequationoftheregressionlineforestimatingbirthweightbasedongestationaldaysis:
Thefittedvalueatagivenvalueof istheestimateof basedonthatvalueof .Inotherwords,thefittedvalueatagivenvalueof istheheightoftheregressionlineatthat .
Thefunctionfitted_valuecomputesthisheight.Likethefunctionscorrelation,slope,andintercept,itsargumentsincludethenameofthetableandthelabelsofthe andcolumns.Butitalsorequiresafourthargument,whichisthevalueof atwhichtheestimatewillbemade.
deffitted_value(table,x,y,given_x):
a=slope(table,x,y)
b=intercept(table,x,y)
returna*given_x+b
Thefittedvalueat300gestationaldaysisabout129.2ounces.Forapregnancythathaslength300gestationaldays,ourestimateforthebaby'sweightisabout129.2ounces.
fitted_value(baby,'GestationalDays','BirthWeight',300)
ComputationalandInferentialThinking
372RegressionInference
129.2129241703143
Thefunctionfitreturnsanarrayconsistingofthefittedvaluesatallofthevaluesof inthetable.
deffit(table,x,y):
a=slope(table,x,y)
b=intercept(table,x,y)
returna*table.column(x)+b
Asweknow,theTablemethodscattercanbeusedtodrawascatterplotwitharegressionlinethroughit.Hereisanotherway,usingfit.Thefunctionscatter_fittakesthesamethreeargumentsasfitandreturnsascatterdiagramalongwiththestraightlineoffittedvalues.
defscatter_fit(table,x,y):
plots.scatter(table.column(x),table.column(y),s=15)
plots.plot(table.column(x),fit(table,x,y),lw=2,color='darkblue'
plots.xlabel(x)
plots.ylabel(y)
scatter_fit(baby,'GestationalDays','BirthWeight')
#Thenexttwolinesjustmaketheploteasiertoread
plots.ylim([0,200])
None
ComputationalandInferentialThinking
373RegressionInference
Assumptionsofrandomness:a"regressionmodel"¶Thusfarinthissection,ouranalysishasbeenpurelydescriptive.Butrecallthatthedataarearandomsampleofallthebirthsinasystemofhospitals.Whatdothedatasayaboutthewholepopulationofbirths?Whatcouldtheysayaboutanewbirthatoneofthehospitals?
Questionsofinferencemayariseifwebelievethatascatterplotreflectstheunderlyingrelationbetweenthetwovariablesbeingplottedbutdoesnotspecifytherelationcompletely.Forexample,ascatterplotofbirthweightversusgestationaldaysshowsusthepreciserelationbetweenthetwovariablesinoursample;butwemightwonderwhetherthatrelationholdstrue,oralmosttrue,forallbabiesinthepopulationfromwhichthesamplewasdrawn,orindeedamongbabiesingeneral.
Asalways,inferentialthinkingbeginswithacarefulexaminationoftheassumptionsaboutthedata.Setsofassumptionsareknownasmodels.Setsofassumptionsaboutrandomnessinroughlylinearscatterplotsarecalledregressionmodels.
Inbrief,suchmodelssaythattheunderlyingrelationbetweenthetwovariablesisperfectlylinear;thisstraightlineisthesignalthatwewouldliketoidentify.However,wearenotabletoseethelineclearly.Whatweseearepointsthatarescatteredaroundtheline.Ineachofthepoints,thesignalhasbeencontaminatedbyrandomnoise.Ourinferentialgoal,therefore,istoseparatethesignalfromthenoise.
Ingreaterdetail,theregressionmodelspecifiesthatthepointsinthescatterplotaregeneratedatrandomasfollows.
Therelationbetween and isperfectlylinear.Wecannotseethis"trueline"but
ComputationalandInferentialThinking
374RegressionInference
Tychecan.SheistheGoddessofChance.Tychecreatesthescatterplotbytakingpointsonthelineandpushingthemoffthelinevertically,eitheraboveorbelow,asfollows:
Foreach ,Tychefindsthecorrespondingpointonthetrueline,andthenaddsanerror.Theerrorsaredrawnatrandomwithreplacementfromapopulationoferrorsthathasanormaldistributionwithmean0.Tychecreatesapointwhosehorizontalcoordinateis andwhoseverticalcoordinateis"theheightofthetruelineat ,plustheerror".
Finally,Tycheerasesthetruelinefromthescatter,andshowsusjustthescatterplotofherpoints.
Basedonthisscatterplot,howshouldweestimatethetrueline?Thebestlinethatwecanputthroughascatterplotistheregressionline.Sotheregressionlineisanaturalestimateofthetrueline.
Thesimulationbelowshowshowclosetheregressionlineistothetrueline.ThefirstpanelshowshowTychegeneratesthescatterplotfromthetrueline;thesecondshowthescatterplotthatwesee;thethirdshowstheregressionlinethroughtheplot;andthefourthshowsboththeregressionlineandthetrueline.
Torunthesimulation,callthefunctiondraw_and_comparewiththreearguments:theslopeofthetrueline,theinterceptofthetrueline,andthesamplesize.
Runthesimulationafewtimes,withdifferentvaluesfortheslopeandinterceptofthetrueline,andvaryingsamplesizes.Becauseallthepointsaregeneratedaccordingtothemodel,youwillseethattheregressionlineisagoodestimateofthetruelineifthesamplesizeismoderatelylarge.
ComputationalandInferentialThinking
375RegressionInference
defdraw_and_compare(true_slope,true_int,sample_size):
x=np.random.normal(50,5,sample_size)
xlims=np.array([np.min(x),np.max(x)])
eps=np.random.normal(0,6,sample_size)
y=(true_slope*x+true_int)+eps
tyche=Table([x,y],['x','y'])
plots.figure(figsize=(6,16))
plots.subplot(4,1,1)
plots.scatter(tyche['x'],tyche['y'],s=15)
plots.plot(xlims,true_slope*xlims+true_int,lw=2,color='green'
plots.title('WhatTychedraws')
plots.subplot(4,1,2)
plots.scatter(tyche['x'],tyche['y'],s=15)
plots.title('Whatwegettosee')
plots.subplot(4,1,3)
scatter_fit(tyche,'x','y')
plots.xlabel("")
plots.ylabel("")
plots.title('Regressionline:ourestimateoftrueline')
plots.subplot(4,1,4)
scatter_fit(tyche,'x','y')
xlims=np.array([np.min(tyche['x']),np.max(tyche['x'])])
plots.plot(xlims,true_slope*xlims+true_int,lw=2,color='green'
plots.title("Regressionlineandtrueline")
#Tyche'strueline,
#thepointsshecreates,
#andourestimateofthetrueline.
#Arguments:trueslope,trueintercept,numberofpoints
draw_and_compare(3,-5,25)
ComputationalandInferentialThinking
376RegressionInference
ComputationalandInferentialThinking
377RegressionInference
PredictionusingRegression¶Inreality,ofcourse,wearenotTyche,andwewillneverseethetrueline.Whatthesimulationshowsthatiftheregressionmodellooksplausible,andifwehavealargesample,thentheregressionlineisagoodapproximationtothetrueline.
Thescatterdiagramofbirthweightsversusgestationaldayslooksroughlylinear.Letusassumethattheregressionmodelholds.Supposenowthatatthehospitalthereisanewbabywhohas300gestationaldays.Wecanusetheregressionlinetopredictthebirthweightofthisbaby.
Aswesawearlier,thefittedvalueat300gestationaldayswasabout129.2ounces.Thatisourpredictionforthebirthweightofthenewbaby.
fit_300=fitted_value(baby,'GestationalDays','BirthWeight',300
fit_300
129.2129241703143
Thefigurebelowshowswherethepredictionliesontheregressionline.Theredlineisat.
scatter_fit(baby,'GestationalDays','BirthWeight')
plots.scatter(300,fit_300,color='red',s=20)
plots.plot([300,300],[0,fit_300],color='red',lw=1)
plots.ylim([0,200])
None
ComputationalandInferentialThinking
378RegressionInference
TheVariabilityofthePrediction¶Wehavedevelopedamethodforpredictinganewbaby'sbirthweightbasedonthenumberofgestationaldays.Butasdatascientists,weknowthatthesamplemighthavebeendifferent.Hadthesamplebeendifferent,theregressionlinewouldhavebeendifferenttoo,andsowouldourprediction.Toseehowgoodourpredictionis,wemustgetasenseofhowvariablethepredictioncanbe.
Onewaytodothiswouldbebygeneratingnewrandomsamplesofpointsandmakingapredictionbasedoneachnewsample.Togeneratenewsamples,wecanbootstrapthescatterplot.
Specifically,wecansimulatenewsamplesbyrandomsamplingwithreplacementfromtheoriginalscatterplot,asmanytimesastherearepointsinthescatter.
Hereistheoriginalscatterdiagramfromthesample,andfourreplicationsofthebootstrapresamplingprocedure.Noticehowtheresampledscatterplotsareingeneralalittlemoresparsethantheoriginal.Thatisbecausesomeoftheoriginalpointdonotgetselectedinthesamples.
ComputationalandInferentialThinking
379RegressionInference
plots.figure(figsize=(8,18))
plots.subplot(5,1,1)
plots.scatter(baby[1],baby[0],s=10)
plots.xlim([150,400])
plots.title('Originalsample')
foriinnp.arange(1,5,1):
plots.subplot(5,1,i+1)
rep=baby.sample(with_replacement=True)
plots.scatter(rep[1],rep[0],s=10)
plots.xlim([150,400])
plots.title('Bootstrapsample'+str(i))
ComputationalandInferentialThinking
380RegressionInference
ComputationalandInferentialThinking
381RegressionInference
Thenextstepistofittheregressionlinetothescatterplotineachreplication,andmakeapredictionbasedoneachline.Thefigurebelowshows10suchlines,andthecorrespondingpredictedbirthweightat300gestationaldays.
x=300
lines=Table(['slope','intercept'])
foriinrange(10):
rep=baby.sample(with_replacement=True)
a=slope(rep,'GestationalDays','BirthWeight')
b=intercept(rep,'GestationalDays','BirthWeight')
lines.append([a,b])
lines['predictionatx='+str(x)]=lines.column('slope')*x+lines.
xlims=np.array([291,309])
left=xlims[0]*lines[0]+lines[1]
right=xlims[1]*lines[0]+lines[1]
fit_x=x*lines['slope']+lines['intercept']
foriinrange(10):
plots.plot(xlims,np.array([left[i],right[i]]),lw=1)
plots.scatter(x,fit_x[i],s=20)
Thepredictionsvaryfromonelinetothenext.Thetablebelowshowstheslopeandinterceptofeachofthe10lines,alongwiththeprediction.
ComputationalandInferentialThinking
382RegressionInference
lines
slope intercept predictionatx=300
0.446731 -5.08136 128.938
0.417925 3.65604 129.034
0.395636 8.95328 127.644
0.512511 -22.9981 130.755
0.477034 -13.4303 129.68
0.515773 -24.5881 130.144
0.48213 -15.7025 128.936
0.492106 -18.7225 128.909
0.434989 -1.72409 128.773
0.486076 -16.5506 129.272
BootstrapPredictionInterval¶Ifweincreasethenumberofrepetitionsoftheresamplingprocess,wecangenerateanempiricalhistogramofthepredictions.Thiswillallowustocreateapredictioninterval,usingmethodslikethoseweusedearliertocreatebootstrapconfidenceintervalsfornumericalparameters.
Letusdefineafunctioncalledbootstrap_predictiontodothis.Thefunctiontakesfivearguments:
thenameofthetablethecolumnlabelsofthepredictorandresponsevariables,inthatorderthevalueof atwhichtomakethepredictionthedesirednumberofbootstraprepetitions
Ineachrepetition,thefunctionbootstrapstheoriginalscatterplotandcalculatestheequationoftheregressionlinethroughthebootstrappedplot.Itthenusesthislinetofindthepredictedvalueof basedonthespecifiedvalueof .Specifically,itcallsthefunctionfitted_valuethatwedefinedearlierinthissection,tocalculatetheslopeandinterceptoftheregressionlineandthencalcuatethefittedvalueatthespecified .
ComputationalandInferentialThinking
383RegressionInference
Finally,itcollectsallofthepredictedvaluesintoacolumnofatable,drawstheempiricalhistogramofthesevalues,andprintstheintervalconsistingofthe"middle95%"ofthepredictedvalues.Italsoprintsthepredictedvaluebasedontheregressionlinethroughtheoriginalscatterplot.
#Bootstrappredictionofvariableyatnew_x
#Datacontainedintable;predictionbyregressionofybasedonx
#repetitions=numberofbootstrapreplicationsoftheoriginalscatterplot
defbootstrap_prediction(table,x,y,new_x,repetitions):
#Foreachrepetition:
#Bootstrapthescatter;
#gettheregressionpredictionatnew_x;
#augmentthepredictionslist
predictions=[]
foriinrange(repetitions):
bootstrap_sample=table.sample(with_replacement=True)
predicted_value=fitted_value(bootstrap_sample,x,y,new_x
predictions.append(predicted_value)
#Predictionbasedonoriginalsample
original=fitted_value(table,x,y,new_x)
#Displayresults
pred=Table().with_column('Prediction',predictions)
pred.hist(bins=20)
plots.xlabel('predictionsatx='+str(new_x))
print('Heightofregressionlineatx='+str(new_x)+':',original
print('Approximate95%-confidenceinterval:')
print((pred.percentile(2.5).rows[0][0],pred.percentile(97.5).rows
bootstrap_prediction(baby,'GestationalDays','BirthWeight',300,
Heightofregressionlineatx=300:129.21292417
Approximate95%-confidenceinterval:
(127.25683835787581,131.33042912205815)
ComputationalandInferentialThinking
384RegressionInference
Thefigureaboveshowsabootstrapempiricalhistogramofthepredictedbirthweightofababyat300gestationaldays,basedon5,000repetitionsofthebootstrapprocess.Theempiricaldistributionisroughlynormal.
Anapproximate95%predictionintervalofscoreshasbeenconstructedbytakingthe"middle95%"ofthepredictions,thatis,theintervalfromthe2.5thpercentiletothe97.5thpercentileofthepredictions.Theintervalrangesfromabout127toabout131.Thepredictionbasedontheoriginalsamplewasabout129,whichisclosetothecenteroftheinterval.
Thefigurebelowshowsthehistogramof5,000bootstrappredictionsat285gestationaldays.Thepredictionbasedontheoriginalsampleisabout122ounces,andtheintervalrangesfromabout121ouncestoabout123ounces.
bootstrap_prediction(baby,'GestationalDays','BirthWeight',285,
Heightofregressionlineatx=285:122.214571016
Approximate95%-confidenceinterval:
(121.17047827189757,123.29506281889765)
ComputationalandInferentialThinking
385RegressionInference
Noticethatthisintervalisnarrowerthanthepredictionintervalat300gestationaldays.Letusinvestigatethereasonforthis.
Themeannumberofgestationaldaysisabout279days:
np.mean(baby['GestationalDays'])
279.10136286201021
So285isnearertothecenterofthedistributionthan300is.Typically,theregressionlinesbasedonthebootstrapsamplesareclosertoeachothernearthecenterofthedistributionofthepredictorvariable.Thereforeallofthepredictedvaluesareclosertogetheraswell.Thisexplainsthenarrowerwidthofthepredictioninterval.
Youcanseethisinthefigurebelow,whichshowspredictionsat and foreachoftenbootstrapreplications.Typically,thelinesarefartherapartat thanat
,andthereforethepredictionsat aremorevariable.
ComputationalandInferentialThinking
386RegressionInference
x1=300
x2=285
lines=Table(['slope','intercept'])
foriinrange(10):
rep=baby.sample(with_replacement=True)
a=slope(rep,'GestationalDays','BirthWeight')
b=intercept(rep,'GestationalDays','BirthWeight')
lines.append([a,b])
xlims=np.array([260,310])
left=xlims[0]*lines[0]+lines[1]
right=xlims[1]*lines[0]+lines[1]
fit_x1=x1*lines['slope']+lines['intercept']
fit_x2=x2*lines['slope']+lines['intercept']
plots.xlim(xlims)
foriinrange(10):
plots.plot(xlims,np.array([left[i],right[i]]),lw=1)
plots.scatter(x1,fit_x1[i],s=20)
plots.scatter(x2,fit_x2[i],s=20)
Istheobservedslopereal?¶
ComputationalandInferentialThinking
387RegressionInference
Wehavedevelopedamethodofpredictingthebirthweightofanewbabyinthehospitalsystem,basedonthebaby'snumberofgestationaldays.Ourpredictionmakesuseofthepositiveassociationthatweobservedbetweenbirthweightandgestationaldays.
Butwhatifthatobservationisspurious?Inotherwords,whatifTyche'struelinewasflat,andthepositiveassociationthatweobservedwasjustduetorandomnessinthepointsthatshegenerated?
Hereisasimulationthatillustrateswhythisquestionarises.Wewillonceagaincallthefunctiondraw_and_compare,thistimerequiringTyche'struelinetohaveslope0.Ourgoalistoseewhetherourregressionlineshowsaslopethatisnot0.
Rememberthattheargumentstothefunctiondraw_and_comparearetheslopeandtheinterceptofTyche'strueline,andthenumberofpointsthatshegenerates.
draw_and_compare(0,10,25)
ComputationalandInferentialThinking
388RegressionInference
ComputationalandInferentialThinking
389RegressionInference
Runthesimulationafewtimes,keepingtheslopeofTyche'sline0eachtime.Youwillnoticethatwhiletheslopeofthetruelineis0,theslopeoftheregressiontypicallyisnot0.Theregressionlinesometimesslopesupwards,andsometimesdownwards,eachtimegivingusafalseimpressionthatthetwovariablesarecorrelated.
Testingwhethertheslopeofthetruelineis0¶Thesesimulationsshowthatbeforeweputtoomuchfaithinourregressionpredictions,itisworthcheckingwhetherthetruelinedoesindeedhaveaslopethatisnot0.
Weareinagoodpositiontodothis,becauseweknowhowtobootstrapthescatterdiagramanddrawaregressionlinethrougheachbootstrappedplot.Eachofthoselineshasaslope,sowecansimplycollectalltheslopesanddrawtheirempiricalhistogram.Wecanthenconstructanapproximate95%confidenceintervalfortheslopeofthetrueline,usingthebootstrappercentilemethodthatisnowsofamiliar.
Iftheconfidenceintervalforthetrueslopedoesnotcontain0,thenwecanbeabout95%confidentthatthetrueslopeisnot0.Iftheintervaldoescontain0,thenwecannotconcludethatthereisagenuninelinearassociationbetweenthevariablethatwearetryingtopredictandthevariablethatwehavechosentouseasapredictor.Wemightthereforewanttolookforadifferentpredictorvariableonwhichtobaseourpredictions.
Thefunctionbootstrap_slopeexecutesourplan.Itsargumentsarethenameofthetableandthelabelsofthepredictorandresponsevariables,andthedesirednumberofbootstrapreplications.Ineachreplication,thefunctionbootstrapstheoriginalscatterplotandcalculatestheslopeoftheresultingregressionline.Itthendrawsthehistogramofallthegeneratedslopesandprintstheintervalconsistingofthe"middle95%"oftheslopes.
Noticethatthecodeforbootstrap_slopeisalmostidenticaltothatforbootstrap_prediction,exceptthatnowallweneedfromeachreplicationistheslope,notapredictedvalue.
ComputationalandInferentialThinking
390RegressionInference
defbootstrap_slope(table,x,y,repetitions):
#Foreachrepetition:
#Bootstrapthescatter,gettheslopeoftheregressionline,
#augmentthelistofgeneratedslopes
slopes=[]
foriinrange(repetitions):
bootstrap_scatter=table.sample(with_replacement=True)
slopes.append(slope(bootstrap_scatter,x,y))
#Slopeoftheregressionlinefromtheoriginalsample
observed_slope=slope(table,x,y)
#Displayresults
slopes=Table().with_column('slopes',slopes)
slopes.hist(bins=20)
print('Slopeofregressionline:',observed_slope)
print('Approximate95%-confidenceintervalforthetrueslope:'
print((slopes.percentile(2.5).rows[0][0],slopes.percentile(97.5
Letuscallbootstrap_slopetoseeifthatthetruelinearrelationbetweenbirthweightandgestationaldayshasslope0;inotherwords,ifTyche'struelineisflat.
bootstrap_slope(baby,'GestationalDays','BirthWeight',5000)
Slopeofregressionline:0.466556876949
Approximate95%-confidenceintervalforthetrueslope:
(0.37981209488027268,0.55753912103512426)
ComputationalandInferentialThinking
391RegressionInference
Theapproximate95%confidenceintervalforthetruesloperunsfromabout0.38ouncesperdaytoabout0.56ouncesperday.Thisintervaldoesnotcontain0.Thereforewecanconclude,withabout95%confidence,thattheslopeofthetruelineisnot0.
Thisconclusionsupportsourchoiceofusinggestationaldaystopredictbirthweight.
Supposenowwetrytoestimatethebirthweightofthebabybasedonthemother'sage.Basedonthesample,theslopeoftheregressionlineforestimatingbirthweightbasedonmaternalageispositive,about0.08ouncesperyear:
slope(baby,'MaternalAge','BirthWeight')
0.085007669415825132
Butanapproximate95%bootstrapconfidenceintervalforthetrueslopehasanegativeleftendpointandapositiverightendpoint–inotherwords,theintervalcontains0.Thereforewecannotrejectthehypothesisthattheslopeofthetruelinearrelationbetweenmaternalageandbaby'sbirthweightis0.Basedonthisobservation,itwouldbeunwisetopredictbirthweightbasedontheregressionmodelwithmaternalageasthepredictor.
bootstrap_slope(baby,'MaternalAge','BirthWeight',5000)
Slopeofregressionline:0.0850076694158
Approximate95%-confidenceintervalforthetrueslope:
(-0.10015674488514391,0.26974979944992733)
ComputationalandInferentialThinking
392RegressionInference
Usingconfidenceintervalstotesthypotheses¶Noticehowwehaveusedconfidenceintervalsforthetrueslopetotestthehypothesisthatthetrueslopeis0.Constructingaconfidenceintervalforaparameterisonewaytotestthenullhypothesisthattheparameterhasaspecifiedvalue.Ifthatvalueisnotintheconfidenceinterval,wecanrejectthenullhypothesisthattheparameterhasthatvalue.Ifthevalueisintheconfidenceinterval,wedonothaveenoughevidencetorejectthenullhypothesis.
Ifthesamplesizeislargeandiftheempiricaldistributionoftheestimateoftheparameterisroughlynormal,thenthismethodoftestingviaconfidenceintervalsisjustifiable.Usinga95%confidenceintervalcorrespondstousinga5%cutofffortheP-value.Thejustificationsofthesestatementsarerathertechnical,andwewillleavethemtoamoreadvancedcourse.Fornow,itisenoughtonotethatconfidenceintervalscanbeusedtotesthypothesesinthisway.
Indeed,themethodismoreconstructivethanothermethodsoftestinghypotheses,becausenotonlydoesitresultinadecisionaboutwhetherornottorejectthenullhypothesis,italsoprovidesanestimateoftheparameter.Forexample,intheexampleofusinggestationaldaystoestimatebirthweights,wedidn'tjustconcludethatthetrueslopewasnot0–wealsoestimatedthatthetrueslopeisbetween0.38ouncesperdayto0.56ouncesperday.
Wordsofcaution¶Allofthepredictionsandteststhatwehaveperformedinthissectionassumethattheregressionmodelholds.Specifically,themethodsassumethatthescatterplotresemblespointsgeneratedbystartingwithpointsthatareonastraightlineandthenpushingthemoffthelinebyaddingrandomnoise.
ComputationalandInferentialThinking
393RegressionInference
Ifthescatterplotdoesnotlooklikethat,thenperhapsthemodeldoesnotholdforthedata.Ifthemodeldoesnothold,thencalculationsthatassumethemodeltobetruearenotvalid.
Therefore,wemustfirstdecidewhethertheregressionmodelholdsforourdata,beforewestartmakingpredictionsbasedonthemodelortestinghypothesesaboutparametersofthemodel.Asimplewayistodowhatwedidinthissection,whichistodrawthescatterdiagramofthetwovariablesandseewhetheritlooksroughlylinearandevenlyspreadoutaroundaline.Inthenextsection,wewillstudysomemoretoolsforassessingthemodel.
ComputationalandInferentialThinking
394RegressionInference
RegressionDiagnosticsInteractThemethodsthatwehavedevelopedforinferenceinregressionarevalidiftheregressionmodelholdsforourdata.Ifthedatadonotsatisfytheassumptionsofthemodel,thenitcanbehardtojustifycalculationsthatassumethatthemodelisgood.
Inthissectionwewilldevelopsomesimpledescriptivemethodsthatcanhelpusdecidewhetherourdatasatisfytheregressionmodel.
Wewillstartwithareviewofthedetailsofthemodel,whichspecifiesthatthepointsinthescatterplotaregeneratedatrandomasfollows.
Tychecreatesthescatterplotbystartingwithpointsthatlieperfectlyonastraightlineandthenpushingthemoffthelinevertically,eitheraboveorbelow,asfollows:
Foreach ,Tychefindsthecorrespondingpointonthetrueline,andthenaddsanerror.Theerrorsaredrawnatrandomwithreplacementfromapopulationoferrorsthathasanormaldistributionwithmean0andanunknownstandarddeviation.Tychecreatesapointwhosehorizontalcoordinateis andwhoseverticalcoordinateis"theheightofthetruelineat ,plustheerror".Iftheerrorispositive,thepointisabovetheline.Iftheerrorisnegative,thepointisbelowtheline.
WegettoseetheresultingpointsbutnotthetruelineonwhichthepointsstartedoutnortherandomerrorsthatTycheadded.
FirstDiagnostic:TheScatterPlot¶
Itisalwayshelpfultostartwithavisualization.Continuingtheexampleofestimatingbirthweightbasedongestationaldays,letuslookatascatterplotofthetwovariables.
baby.scatter('GestationalDays','BirthWeight')
ComputationalandInferentialThinking
395RegressionDiagnostics
Thepointsdoappeartoberoughlyclusteredaroundastraightline.Thereisnostrikinglynoticeablecurveinthescatter.
Aplotoftheregressionlinethroughthescatterplotshowsthatabouthalfthepointsareabovethelineandhalfarebelowthelinealmostsymmetrically,supportingthemodel'sassumptionthatTycheisaddingrandomerrorsthatarenormallydistributedwithmean0.
scatter_fit(baby,'GestationalDays','BirthWeight')
SecondDiagnostic:AResidualPlot¶
ComputationalandInferentialThinking
396RegressionDiagnostics
Wecannotseethetruelinethatisthebasisoftheregressionmodel,buttheregressionlinethatwecalculateactsasasubstituteforit.SoalsotheverticaldistancesofthepointsfromtheregressionlineactassubstitutesfortherandomerrorsthatTycheusestopushthepointsverticallyoffthetrueline.
Aswehaveseenearlier,theverticaldistancesofthepointsfromtheregressionlinearecalledresiduals.Thereisoneresidualforeachpointinthescatterplot.Theresidualisthedifferencebetweentheobservedvalueof andthefittedvalueof :
Forthepoint ,
Thetablebaby_regressioncontainsthedata,thefittedvaluesandtheresiduals:
baby2=baby.select(['GestationalDays','BirthWeight'])
fitted=fit(baby,'GestationalDays','BirthWeight')
residuals=baby.column('BirthWeight')-fitted
baby_regression=baby2.with_columns([
'FittedValue',fitted,
'Residual',residuals
])
baby_regression
GestationalDays BirthWeight FittedValue Residual
284 120 121.748 -1.74801
282 113 120.815 -7.8149
279 128 119.415 8.58477
282 108 120.815 -12.8149
286 136 122.681 13.3189
244 138 103.086 34.9143
245 132 103.552 28.4477
289 120 124.081 -4.0808
299 143 128.746 14.2536
351 140 153.007 -13.0073
...(1164rowsomitted)
ComputationalandInferentialThinking
397RegressionDiagnostics
Onceagain,ithelpstovisualize.Thefigurebelowshowsfouroftheresiduals;foreachofthefourpoints,thelengthoftheredsegmentistheresidualforthatpoint.Thereisonesuchresidualforeachoftheother1170pointsaswell.
points=Table().with_columns([
'GestationalDays',[148,244,338,351],
'BirthWeight',[116,138,102,140]
])
scatter_fit(baby,'GestationalDays','BirthWeight')
foriinrange(4):
f=fitted_value(baby,'GestationalDays','BirthWeight',points
plots.plot([points.column(0)[i]]*2,[f,points.column(1)[i]],color
Accordingtotheregressionmodel,Tychegenerateserrorsbydrawingatrandomwithreplacementfromapopulationoferrorsthatisnormallydistributedwithmean0.Toseewhetherthislooksplausibleforourdata,wewillexaminetheresiduals,whichareoursubstitutesforTyche'serrors.
Aresidualplotcanbedrawnbyplottingtheresidualsagainst .Thefunctionresidual_plotdoesjustthat.Thefunctionregression_diagnostic_plotsdrawstheorigbinalscatterplotaswellastheresidualplotforeaseofcomparison.
ComputationalandInferentialThinking
398RegressionDiagnostics
defresidual_plot(table,x,y):
fitted=fit(table,x,y)
residuals=table[y]-fitted
t=Table().with_columns([
x,table[x],
'residuals',residuals
])
t.scatter(x,'residuals',color='r')
xlims=[min(table[x]),max(table[x])]
plots.plot(xlims,[0,0],color='darkblue',lw=2)
plots.title('ResidualPlot')
defregression_diagnostic_plots(table,x,y):
scatter_fit(table,x,y)
residual_plot(table,x,y)
regression_diagnostic_plots(baby,'GestationalDays','BirthWeight'
ComputationalandInferentialThinking
399RegressionDiagnostics
Theheightofeachredpointintheplotaboveisaresidual.Noticehowtheresidualsaredistributedfairlysymmetricallyaboveandbelowthehorizontallineat0.Noticealsothattheverticalspreadoftheplotisfairlyevenacrossthemostcommonvaluesof ,inaboutthe200-300rangeofgestationaldays.Inotherwords,apartfromafewoutlyingpoints,theplotisn'tnarrowerinsomeplacesandwiderinothers.
Allofthisisconsistentwiththemodel'sassumptionsthattheerrorsaredrawnatrandomwithreplacementfromanormallydistributedpopulationwithmean0.Indeed,ahistogramoftheresidualsiscenteredat0andlooksroughlybellshaped.
baby_regression.select('Residual').hist()
ComputationalandInferentialThinking
400RegressionDiagnostics
Theseobservationsshowusthatthedataarefairlyconsistentwiththeassumptionsoftheregressionmodel.Thereforeitmakessensetodoinferencebasedonthemodel,aswedidintheprevioussection.
UsingtheResidualPlottoDetectNonlinearity¶
Residualplotscanbeusedtodetectwhethertheregressionmodeldoesnothold.Themostseriousdeparturefromthemodelisnon-linearity.
Iftwovariableshaveanon-linearrelation,thenon-linearityissometimesvisibleinthescatterplot.Often,however,itiseasiertospotnon-linearityinaresidualplotthanintheoriginalscatterplot.Thisisusuallybecauseofthescalesofthetwoplots:theresidualplotallowsustozoominontheerrorsandhencemakesiteasiertospotpatterns.
Asanexample,wewillstudyadatasetontheageandlengthofdugongs,whicharemarinemammalsrelatedtomanateesandseacows.Thedataareinatablecalleddugong.Ageismeasuredinyearsandlengthinmeters.Theagesareestimated;mostdugongsdon'tkeeptrackoftheirbirthdays.
dugong=Table.read_table('http://www.statsci.org/data/oz/dugongs.txt'
dugong
ComputationalandInferentialThinking
401RegressionDiagnostics
Age Length
1 1.8
1.5 1.85
1.5 1.87
1.5 1.77
2.5 2.02
4 2.27
5 2.15
5 2.26
7 2.35
8 2.47
...(17rowsomitted)
Hereisaregressionoflength(theresponse)onage(thepredictor).Thecorrelationbetweenthetwovariablesisabout0.83,whichisquitehigh.
correlation(dugong,'Age','Length')
0.82964745549057139
scatter_fit(dugong,'Age','Length')
ComputationalandInferentialThinking
402RegressionDiagnostics
Highcorrelationnotwithstanding,theplotshowsacurvedpatternthatismuchmorevisibleintheresidualplot.
regression_diagnostic_plots(dugong,'Age','Length')
ComputationalandInferentialThinking
403RegressionDiagnostics
Atthelowendofages,theresidualsarealmostallnegative;thentheyarealmostallpositive;thennegativeagainatthehighendofages.SuchadiscerniblepatternwouldhavebeenunlikelyhadTychebeenpickingerrorsatrandomwithreplacementandusingthemtopushpointsoffastraightline.Thereforetheregressionmodeldoesnotholdforthesedata.
Asanotherexample,thedatasetus_womencontainstheaverageweightsofU.S.womenofdifferentheightsinthe1970s.Weightsaremeasuredinpoundsandheightsininches.Thecorrelationisspectacularlyhigh–morethan0.99.
us_women=Table.read_table('us_women.csv')
correlation(us_women,'height','aveweight')
0.99549476778421608
Thepointsseemtohugtheregressionlineprettyclosely,buttheresidualplotflashesawarningsign:itisU-shaped,indicatingthattherelationbetweenthetwovariablesisnotlinear.
regression_diagnostic_plots(us_women,'height','aveweight')
ComputationalandInferentialThinking
404RegressionDiagnostics
Theseexamplesdemonstratethatevenwhencorrelationishigh,fittingastraightlinemightnotbetherightthingtodo.Residualplotshelpusdecidewhetherornottouselinearregression.
UsingtheResidualPlottoDetectHeteroscedasticity¶
HeteroscedasticityisawordthatwillsurelybeofinteresttothosewhoarepreparingforSpellingBees.Fordatascientists,itsinterestliesinitsmeaning,whichis"unevenspread".
RecallthetablehybridthatcontainsdataonhybridcarsintheU.S.Hereisaregressionoffuelefficiencyontherateofacceleration.Theassociationisnegative:carsthataccelearatequicklytendtobelessefficient.
ComputationalandInferentialThinking
405RegressionDiagnostics
regression_diagnostic_plots(hybrid,'acceleration','mpg')
Noticehowtheresidualplotflaresouttowardsthelowendoftheaccelerations.Inotherwords,thevariabilityinthesizeoftheerrorsisgreaterforlowvaluesofaccelerationthanforhighvalues.Suchapatternwouldbeunlikelyundertheregressionmodel,becausethemodelassumesthatTychepickstheerrorsatrandomwithreplacementfromthesamenormallydistributedpopulation.
Unevenverticalvariationinaresidualplotcanindicatethattheregressionmodeldoesnothold.Unevenvariationisoftenmoreeasilynoticedinaresidualplotthanintheoriginalscatterplot.
ComputationalandInferentialThinking
406RegressionDiagnostics
TheAccuracyoftheRegressionEstimate¶
Wewillendourdiscussionofregressionbypointingoutsomeinterestingmathematicalfactsthatleadtoameasureoftheaccuracyofregression.Wewillleavetheproofstoanothercourse,butwewilldemonstratethefactsbyexamples.Notethatallofthefactsholdforallscatterplots,whetherornottheregressionmodelisgood.
Fact1.Nomatterwhattheshapeofthescatterdiagram,theresidualshaveanaverageof0.
Inalltheresidualplotsabove,youhaveseenthehorizontallineat0goingthroughthecenteroftheplot.Thatisavisualizationofthisfact.
Asanumericalexample,hereistheaverageoftheresidualsintheregressionofbirthweightongestationaldaysofbabies:
np.mean(baby_regression.column('Residual'))
-6.3912532279613784e-15
Thatdoesn'tlooklike0,butinfactitisaverysmallnumberthatis0apartfromroundingerror.
Hereistheaverageoftheresidualsintheregressionofthelengthofdugongsontheirage:
dugong_residuals=dugong.column('Length')-fit(dugong,'Age','Length'
np.mean(dugong_residuals)
-2.8783559897689243e-16
Onceagainthemeanoftheresidualsis0apartfromroundingerror.
Thisisanalogoustothefactthatifyoutakeanylistofnumbersandcalculatethelistofdeviationsfromaverage,theaverageofthedeviationsis0.
Fact2.Nomatterwhattheshapeofthescatterdiagram,theSDofthefittedvaluesistimestheSDoftheobservedvaluesof .
Tounderstandthisresult,itisworthnotingthatthefittedvaluesareallontheregressionlinewhereastheobservedvaluesof aretheheightsofallthepointsinthescatterplotandaremorevariable.
ComputationalandInferentialThinking
407RegressionDiagnostics
scatter_fit(baby,'GestationalDays','BirthWeight')
Thefittedvaluesofbirthweightare,therefore,lessvariablethantheobservedvaluesofbirthweight.Buthowmuchlessvariable?
Thescatterplotshowsacorrelationofabout0.4:
correlation(baby,'GestationalDays','BirthWeight')
0.40754279338885108
HereisratiooftheSDofthefittedvaluesandtheSDoftheobservedvaluesofbirthweight:
fitted_birth_weight=baby_regression.column('FittedValue')
observed_birth_weight=baby_regression.column('BirthWeight')
np.std(fitted_birth_weight)/np.std(observed_birth_weight)
0.40754279338885108
Theratiois .ThustheSDofthefittedvaluesis timestheSDoftheobservedvaluesof .
Theexampleoffuelefficiencyandaccelerationisonewherethecorrelationisnegative:
ComputationalandInferentialThinking
408RegressionDiagnostics
correlation(hybrid,'acceleration','mpg')
-0.5060703843771186
fitted_mpg=fit(hybrid,'acceleration','mpg')
observed_mpg=hybrid.column('mpg')
np.std(fitted_mpg)/np.std(observed_mpg)
0.5060703843771186
TheratioofthetwoSDsis .ThustheSDofthefittedvaluesis timestheSDoftheobservedvaluesoffuelefficiency.
Therelationsaysthatthecloserthescatterdiagramistoastraightline,thecloserthefittedvalueswillbetotheobservedvaluesof ,andthereforethecloserthevarianceofthefittedvalueswillbetothevarianceoftheobservedvalues.
Therefore,thecloserthescatterdiagramistoastraightline,thecloser willbeto1,andthecloser willbeto1or-1.
Fact3.Nomatterwhattheshapeofthescatterdiagram,theSDoftheresidualsistimestheSDoftheobservedvaluesof .Inotherwords,therootmeansquared
errorofregressionis timestheSDof .
Thisfactgivesameasureoftheaccuracyoftheregressionestimate.
Again,wewilldemonstratethisbyexample.Inthecaseofgestationaldaysandbirth
weights, isabout0.4and isabout0.91:
r=correlation(baby,'GestationalDays','BirthWeight')
np.sqrt(1-r**2)
0.91318611003278638
ComputationalandInferentialThinking
409RegressionDiagnostics
residuals=baby_regression.column('Residual')
observed_birth_weight=baby_regression.column('BirthWeight')
np.std(residuals)/np.std(observed_birth_weight)
0.9131861100327866
ThustheSDoftheresidualsis timestheSDoftheobservedbirthweights.
Itwillcomeasnosurprisethatthesamerelationistruefortheregressionoffuelefficiencyonacceleration:
r=correlation(hybrid,'acceleration','mpg')
np.sqrt(1-r**2)
0.862492183185677
residuals=hybrid.column('mpg')-fitted_mpg
observed_mpg=hybrid.column('mpg')
np.std(residuals)/np.std(observed_mpg)
0.862492183185677
LetusexaminetheextremecasesofthegeneralfactthattheSDoftheresidualsis
timestheSDoftheobservedvaluesof .Remembertheresidualsalwaysaverageoutto0.
If or ,then .Thereforetheresidualshaveanaverageof0andanSDof0aswell;therefore,theresidualsareallequalto0.Thisisconsistentwiththeobservationthatif ,thescatterplotisaperfectstraightlineandthereisnoerrorintheregressionestimate.
If ,the andtheSDoftheregressionestimateisequaltotheSDof .Thisisconsistentwiththeobservationthatif thentheregressionlineisaflatlineataverageof ,andtherootmeansquareerrorofregressionistherootmeansquareddeviationfromtheaverageof ,whichistheSDof .
ComputationalandInferentialThinking
410RegressionDiagnostics
Foreveryothervalueof ,theroughsizeoftheerroroftheregressionestimateissomewherebetween0andtheSDof .
ComputationalandInferentialThinking
411RegressionDiagnostics
Chapter5
Probability
ComputationalandInferentialThinking
412Probability
ConditionalProbabilityInteractInspiredbyPeterNorvig'sAConcreteIntroductiontoProbability
Probabilityisthestudyoftheobservationsthataregeneratedbysamplingfromaknowndistribution.Thestatisticalmethodswehavestudiedsofarallowustoreasonabouttheworldfromdata.Probabilityallowsustoreasonintheoppositedirections:whatobservationsresultfromknowfactsabouttheworld.
Inpractice,werarelyknowpreciselyhowrandomprocessesintheworldwork.Eventhesimplestrandomprocesssuchasflippingacoinisnotassimpleasonemightassume.Nonetheless,ourabilitytoreasonaboutwhatdatawillresultfromaknownrandomprocessisusefulinmanyaspectsofdatascience.Fundamentally,therulesofprobabilityallowustoreasonabouttheconsequencesofassumptionswemake.
ProbabilityDistributions¶Probabilityusesthefollowingvocabularytodescribetherelationshipbetweenknowndistributionsandtheirobservedoutcomes.
Experiment:Anoccurrencewithanuncertainoutcomethatwecanobserve.Forexample,theresultofrollingtwodiceinsequence.Outcome:Theresultofanexperiment;oneparticularstateoftheworld.Forexample:thefirstdiecomesup4andthesecondcomesup2.Thisoutcomecouldbesummarizedas(4,2).SampleSpace:Thesetofallpossibleoutcomesfortheexperiment.Forexample,(1,1),(1,2),(1,3),...,(1,6),(2,1),(2,2),...(6,4),(6,5),(6,6)areallpossibleoutcomesofrollingtwodiceinsequence.Thereare6*6=36differentoutcomes.Event:Asubsetofpossibleoutcomesthattogetherhavesomepropertyweareinterestedin.Forexample,theevent"thetwodicesumto5"isthesetofoutcomes(1,4),(2,3),(3,2),(4,1).Probability:Theproportionofexperimentsforwhichtheeventoccurs.Forexample,theprobabilitythatthetwodicesumto5is4/36or1/9.Distribution:Theprobabilityofallevents.
ComputationalandInferentialThinking
413ConditionalProbability
Theimportantpartofthisterminologyisthatthesamplespaceisthesetofalloutcomes,andaneventisasubsetofthesamplespace.Foranoutcomethatisanelementofanevent,wesaythatitisanoutcomeforwhichthateventoccurs.
MultinomialDistributions¶
Therearemanykindsofdistributions,butwewillfocusonthecasewherethesamplespaceisafixed,finitesetofmutuallyexclusiveoutcomes,calledamultinomialdistribution.
Forexample,eitherthetwodicecomeup(2,1)ortheycomeup(4,5)inasingleroll(butbothcannotoccur),andthereareexactly36outcomepossibilitiesthateachoccurwithequalchance.
two_dice=Table(['First','Second','Chance'])
forfirstinnp.arange(1,7):
forsecondinnp.arange(1,7):
two_dice.append([first,second,1/36])
two_dice.set_format('Chance',PercentFormatter(1))
First Second Chance
1 1 2.8%
1 2 2.8%
1 3 2.8%
1 4 2.8%
1 5 2.8%
1 6 2.8%
2 1 2.8%
2 2 2.8%
2 3 2.8%
2 4 2.8%
...(26rowsomitted)
Theoutcomesarenotalwaysequallylikely.Intheexampleoftwodicerolledinsequence,thereare36differentoutcomesthateachhavea1/36chanceofoccurring.Iftwodicearerolledtogetherandonlythesumoftheresultisobserved,insteadthereare11outcomesthathavevariousdifferentchances.
ComputationalandInferentialThinking
414ConditionalProbability
two_dice_sums=Table(['Sum','Chance']).with_rows([
[2,1/36],[3,2/36],[4,3/36],[5,4/36],[6,5/36],[
[12,1/36],[11,2/36],[10,3/36],[9,4/36],[8,5/36],
]).sort(0)
two_dice_sums.set_format('Chance',PercentFormatter(1))
Sum Chance
2 2.8%
3 5.6%
4 8.3%
5 11.1%
6 13.9%
7 16.7%
8 13.9%
9 11.1%
10 8.3%
11 5.6%
...(1rowsomitted)
OutcomesandEvents¶
Thesetwodifferentdicedistributionsarecloselyrelatedtoeachother.Let'sseehow.
Inamultinomialdistribution,eachoutcomealoneisaneventwithitsownchance(orprobability).Thechancesofalloutcomessumto1.Theprobabilityofanyeventisthesumoftheprobabilitiesoftheoutcomesinthatevent.Therefore,bywritingdownthechanceofeachoutcome,weimplicitlydescribetheprobabilityofeveryeventforthewholedistribution.
Let'sconsidertheeventthatthesumofthediceintwosequentialrollsis5.Wecanrepresentthateventasthematchingrowsinthetwo_dicetable.
dice_sums=two_dice.column('First')+two_dice.column('Second')
sum_of_5=two_dice.where(dice_sums==5)
sum_of_5
ComputationalandInferentialThinking
415ConditionalProbability
First Second Chance
1 4 2.8%
2 3 2.8%
3 2 2.8%
4 1 2.8%
Theprobabilityoftheeventisthesumoftheresultingchances.
sum(sum_of_5.column('Chance'))
0.1111111111111111
IfweconsistentlystoretheprobabilitiesofindividualoutcomesinacolumncalledChance,thenwecandefinetheprobabilityofanyeventusingthePfunction.
defP(event):
returnsum(event.column('Chance'))
P(sum_of_5)
0.1111111111111111
Onedistribution'seventscanbeanotherdistribution'soutcomes,asisthecasewiththetwo_diceandtwo_dice_sumdistributions.Thereare11differentpossiblesumsinthetwo_dicetable,andtheseare11differentmutuallyexclusiveevents.
with_sums=two_dice.with_column('Sum',dice_sums)
with_sums
ComputationalandInferentialThinking
416ConditionalProbability
First Second Chance Sum
1 1 2.8% 2
1 2 2.8% 3
1 3 2.8% 4
1 4 2.8% 5
1 5 2.8% 6
1 6 2.8% 7
2 1 2.8% 3
2 2 2.8% 4
2 3 2.8% 5
2 4 2.8% 6
...(26rowsomitted)
Groupingtheseoutcomesbytheirsumshowshowmanydifferentoutcomesappearineachevent.
with_sums.group('Sum')
Sum count
2 1
3 2
4 3
5 4
6 5
7 6
8 5
9 4
10 3
11 2
...(1rowsomitted)
Finally,thechanceofeachsumeventisthetotalchanceofalloutcomesthatmatchthesumevent.Inthisway,wecangeneratethetwo_dice_sumsdistributionviaaddition.
ComputationalandInferentialThinking
417ConditionalProbability
grouped=with_sums.select(['Sum','Chance']).group('Sum',sum)
grouped.relabeled(1,'Chance').set_format('Chance',PercentFormatter
Sum Chance
2 2.8%
3 5.6%
4 8.3%
5 11.1%
6 13.9%
7 16.7%
8 13.9%
9 11.1%
10 8.3%
11 5.6%
...(1rowsomitted)
Bothdistributionswillassignthesameprobabilitytocertainevents.Forexample,theprobabilityoftheeventthatthesumofthediceisabove8canbecomputedfromeitherdistribution.
P(with_sums.where('Sum',8))
0.1388888888888889
P(two_dice_sums.where('Sum',8))
0.1388888888888889
U.S.BirthTimes¶
Here'sanexamplebasedondatacollectedintheworld.
ComputationalandInferentialThinking
418ConditionalProbability
MorebabiesintheUnitedStatesareborninthemorningthanthenight.In2013,theCenterforDiseaseControlmeasuredthefollowingproportionsofbabiesforeachhouroftheday.TheTimecolumnisadescriptionofeachhour,andtheHourcolumnisitsnumericequivalent.
birth=Table.read_table('birth_time.csv').select(['Time','Hour',
birth.set_format('Chance',PercentFormatter(1)).show()
ComputationalandInferentialThinking
419ConditionalProbability
Time Hour Chance
6a.m. 6 2.9%
7a.m. 7 4.5%
8a.m. 8 6.3%
9a.m. 9 5.0%
10a.m. 10 5.0%
11a.m. 11 5.0%
Noon 12 6.0%
1p.m. 13 5.7%
2p.m. 14 5.1%
3p.m. 15 4.8%
4p.m. 16 4.9%
5p.m. 17 5.0%
6p.m. 18 4.5%
7p.m. 19 4.0%
8p.m. 20 4.0%
9p.m. 21 3.7%
10p.m. 22 3.5%
11p.m. 23 3.3%
Midnight 0 2.9%
1a.m. 1 2.9%
2a.m. 2 2.8%
3a.m. 3 2.7%
4a.m. 4 2.7%
5a.m. 5 2.8%
Ifweassumethatthisdistributiondescribestheworldcorrectly,whatisthechancethatababywillbebornbetween8a.m.and6p.m.?Itturnsoutthatmorethanhalfofallbabiesarebornduring"businesshours",eventhoughthistimeintervalonlycovers5/12oftheday.
business_hours=birth.where('Hour',are.between(8,18))
business_hours
ComputationalandInferentialThinking
420ConditionalProbability
Time Hour Chance
8a.m. 8 6.3%
9a.m. 9 5.0%
10a.m. 10 5.0%
11a.m. 11 5.0%
Noon 12 6.0%
1p.m. 13 5.7%
2p.m. 14 5.1%
3p.m. 15 4.8%
4p.m. 16 4.9%
5p.m. 17 5.0%
P(business_hours)
0.52800000000000002
Ontheotherhand,birthslateatnightareuncommon.Thechanceofhavingababybetweenmidnightand6a.m.ismuchlessthan25%,eventhoughthistimeintervalcovers1/4oftheday.
P(birth.where('Hour',are.between(0,6)))
0.16799999999999998
ConditionalDistributions¶Acommonwaytotransformonedistributionintoanotheristoconditiononanevent.Conditioningmeansassumingthattheeventoccurs.Theconditionaldistributiongivenaneventtakesthechancesofalloutcomesforwhichtheeventoccursandscalesuptheirchancessothattheysumto1.
Toscaleupthechances,wedividethechanceofeachoutcomeintheeventbytheprobabilityoftheevent.Forexample,ifweconditionontheeventthatthesumoftwodiceisabove8,wearriveatthefollowingconditionaldistribution.
ComputationalandInferentialThinking
421ConditionalProbability
above_8=two_dice_sums.where('Sum',are.above(8))
given_8=above_8.with_column('Chance',above_8.column('Chance')/
given_8
Sum Chance
9 40.0%
10 30.0%
11 20.0%
12 10.0%
Aconditionaldistributiondescribesthechancesundertheoriginaldistribution,updatedtoreflectanewpieceofinformation.Inthiscase,weseethechancesundertheoriginaltwo_dice_sumsdistribution,giventhatthesumisabove8.
Theconditioningoperationcanbeexpressedbythegivenfunction,whichtakesaneventtableandreturnsthecorrespondingconditionaldistributon.
defgiven(event):
returnevent.with_column('Chance',event.column('Chance')/P(event
given(two_dice_sums.where('Sum',are.above(8)))
Sum Chance
9 40.0%
10 30.0%
11 20.0%
12 10.0%
GiventhatababyisbornintheUSduringbusinesshours(between8a.m.and6p.m.),whatarethechancesthatitisbornbeforenoon?Toanswerthisquestion,wefirstformtheconditionaldistributiongivenbusinesshours,thencomputetheprobabilityoftheeventthatthebabyisbornbeforenoon.
given(business_hours)
ComputationalandInferentialThinking
422ConditionalProbability
Time Hour Chance
8a.m. 8 11.9%
9a.m. 9 9.5%
10a.m. 10 9.5%
11a.m. 11 9.5%
Noon 12 11.4%
1p.m. 13 10.8%
2p.m. 14 9.7%
3p.m. 15 9.1%
4p.m. 16 9.3%
5p.m. 17 9.5%
P(given(business_hours).where('Hour',are.below(12)))
0.40340909090909094
Thisresultisnotthesameasthechancethatababyisbornbetween8a.m.andnooningeneral.
morning=birth.where('Hour',are.between(8,12))
P(morning)
0.21300000000000002
Norisitthechancethatababyisbornbeforenooningeneral.
P(birth.where('Hour',are.below(12)))
0.45500000000000007
Instead,itistheprobabilitythatboththeeventandtheconditioningeventaretrue(i.e.,thatthebirthisinthemorning),dividedbytheprobabilityoftheconditioningevent(i.e.,thatthebirthisduringbusinesshours).
ComputationalandInferentialThinking
423ConditionalProbability
P(morning)/P(business_hours)
0.40340909090909094
Themorningeventisajointeventthatcanbedescribedasbothduringbusinesshoursandbeforenoon.Ajointeventisjustanevent,butonethatisdescribedbytheintersectionoftwootherevents.
business_hours.where('Hour',are.below(12))
Time Hour Chance
8a.m. 8 6.3%
9a.m. 9 5.0%
10a.m. 10 5.0%
11a.m. 11 5.0%
Ingeneral,theprobabilityofaneventB(e.g.,beforenoon)conditionedonaneventA(e.g.,duringbusinesshours)istheprobabilityofthejointeventofAandBdividedbytheprobabilityoftheconditioningeventA.
P(business_hours.where('Hour',are.below(12)))/P(business_hours)
0.40340909090909094
Thestandardnotationforthisrelationshipusesaverticalbarforconditioningandacommaforajointevent.
JointDistributions¶Ajointeventisoneinwhichtwodifferenteventsbothoccur.Similarly,ajointoutcomeisanoutcomethatiscomposedoftwodifferentoutcomes.Forexample,theoutcomesofthetwo_dicedistributionareeachjointoutcomesofthefirstandseconddicerolls.
ComputationalandInferentialThinking
424ConditionalProbability
two_dice
First Second Chance
1 1 2.8%
1 2 2.8%
1 3 2.8%
1 4 2.8%
1 5 2.8%
1 6 2.8%
2 1 2.8%
2 2 2.8%
2 3 2.8%
2 4 2.8%
...(26rowsomitted)
Anothercommonwaytotransformdistributionsistoformajointdistributionfromconditionaldistributions.Thedefinitionofaconditionalprobabilityprovidesaformulaforcomputingajointprobability,simplybymultiplyingbothsidesoftheequationaboveby .
Thechanceofeachoutcomeofajointdistributioncanbecomputedfromtheprobabilityofthefirstoutcomeandtheconditionalprobabilityofthesecondoutcomegiventhefirstoutcome.Inthetwo_dicetableabove,thechanceofanyfirstoutcomeis1/6,andthechanceofanysecondoutcomegivenanyfirstoutcomeis1/6,sothechanceofanyjointoutcomeis1/36or2.8%.
U.S.BirthDays¶
TheCDCalsopublishedthatin2013,78.25%ofbabiesbornonweekdays(MondaythroughFriday)and21.75%ofbabieswerebornonweekends(SaturdayandSunday).Notably,thisdistributionisnot5/7(71.4%)to2/7(28.6%),butinsteadfavorsweekdays.
Moreover,theCDCpublishedtheconditionaldistributionsof2013birthtimes,givenaweekdaybirth(green)andgivenaweekendbirth(gray).Theblacklineisthedistributionofallbirths.
ComputationalandInferentialThinking
425ConditionalProbability
Alloftheproportionsusedtogeneratethischartappearinthebirth_daytable.
birth_day=Table.read_table('birth_time.csv').drop('Chance')
birth_day.set_format([2,3],PercentFormatter(1))
Time Hour Weekday Weekend
6a.m. 6 2.7% 3.6%
7a.m. 7 4.7% 3.8%
8a.m. 8 6.7% 4.6%
9a.m. 9 5.1% 5.0%
10a.m. 10 5.0% 5.0%
11a.m. 11 5.0% 4.9%
Noon 12 6.3% 5.0%
1p.m. 13 5.9% 4.7%
2p.m. 14 5.3% 4.6%
3p.m. 15 4.9% 4.6%
...(14rowsomitted)
TheWeekdayandWeekendcolumnsarethechancesoftwodifferentconditionaldistributions.
ComputationalandInferentialThinking
426ConditionalProbability
weekday=birth_day.select(['Hour','Weekday']).relabeled(1,'Chance'
weekend=birth_day.select(['Hour','Weekend']).relabeled(1,'Chance'
Thejointdistributioniscreatedbymultiplyingeachconditionalchancecolumnbythecorrespondingprobabilityoftheconditioningevent.Inthiscase,wemultiplyweekdaychancesbythechancethatthebabywasbornonaweekday(78.25%);likewise,wemultiplyweekendchancesbythechancethatthebabywasbornonaweekend(21.75%).
birth_joint=Table(['Day','Hour','Chance'])
forrowinweekday.rows:
birth_joint.append(['Weekday',row.item('Hour'),row.item('Chance'
forrowinweekend.rows:
birth_joint.append(['Weekend',row.item('Hour'),row.item('Chance'
birth_joint.set_format('Chance',PercentFormatter(1))
Day Hour Chance
Weekday 6 2.1%
Weekday 7 3.7%
Weekday 8 5.2%
Weekday 9 4.0%
Weekday 10 3.9%
Weekday 11 3.9%
Weekday 12 4.9%
Weekday 13 4.6%
Weekday 14 4.1%
Weekday 15 3.8%
...(38rowsomitted)
Thisjointdistributionhas48jointoutcomes,andthetotalchancesumsto1.
P(birth_joint)
1.0
ComputationalandInferentialThinking
427ConditionalProbability
Now,wecancomputetheprobabilityofanyeventthatinvolvesdaysandhours.Forexample,thechanceofaweekdaymorningbirthisquitehigh.
P(birth_joint.where('Day','Weekday').where('Hour',are.between(8,
0.17058499999999999
Wehaveseenthatlessthan22%ofbabiesarebornonweekends.However,ifweknowthatababywasbornbetween5a.m.and6a.m.,thenthechancesthatitisaweekendbabyarequiteabithigherthan22%.
early_morning=birth_joint.where('Hour',5)
early_morning
Day Hour Chance
Weekday 5 2.0%
Weekend 5 0.8%
P(given(early_morning).where('Day','Weekend'))
0.28584466551063248
Bayes'Rule¶Inthefinalexampleabove,webeganwithadistributionoverweekdayandweekendbirths,
,alongwithconditionaldistributionsoverhours, .Wewereabletocomputetheprobabilityofaweekendbirthconditionedonanhour,anoutcomeinthedistribution
.Bayes'ruleistheformulathatexpressestherelationshipbetweenallofthesequantities.
ComputationalandInferentialThinking
428ConditionalProbability
Thenumeratorontheright-handsideisjustthejointprobabilityofAandB.Bayes'rulewritesthisprobabilityinitsexpandedformbecausesooftenwearegiventhesetwocomponentsandmustformthejointdistributionthroughmultiplication.
Thedenominatorontheright-handsideisaneventthatcanbecomputedfromthejointprobability.Forexample,theearly_morningeventintheexampleaboveisaneventjustabouthours,butitincludesalljointoutcomeswherethecorrecthouroccurs.
Eachprobabilityinthisequationhasaname.ThenamesarederivedfromthemosttypicalapplicationofBayes'rule,whichistoupdateone'sbeliefsbasedonnewevidence.
iscalledtheprior;itistheprobabilityofeventAbeforeanyevidenceisobserved.
iscalledthelikelihood;itistheconditionalprobabilityoftheevidenceeventBgiventheeventA.
iscalledtheevidence;itistheprobabilityoftheevidenceeventBforanyoutcome.
iscalledtheposterior;itistheprobabilytofeventAafterevidenceeventBisobserved.
DependingonthejointdistributionofAandB,observingsomeBcanmakeAmoreorlesslikely.
DiagnosticExample¶
Inapopulation,thereisararedisease.Researchershavedevelopedamedicaltestforthedisease.Mostly,thetestcorrectlyidentifieswhetherornotthetestedpersonhasthedisease.Butsometimes,thetestiswrong.Herearetherelevantproportions.
1%ofthepopulationhasthediseaseIfapersonhasthedisease,thetestreturnsthecorrectresultwithchance99%.Ifapersondoesnothavethedisease,thetestreturnsthecorrectresultwithchance99.5%.
Onepersonispickedatrandomfromthepopulation.Giventhatthepersontestspositive,whatisthechancethatthepersonhasthedisease?
Webeginbypartitioningthepopulationintofourcategoriesinthetreediagrambelow.
ComputationalandInferentialThinking
429ConditionalProbability
ByBayes'Rule,thechancethatthepersonhasthediseasegiventhatheorshehastestedpositiveisthechanceofthetop"TestPositive"branchrelativetothetotalchanceofthetwo"TestPositive"branches.Theansweris
#Thepersonispickedatrandomfromthepopulation.
#ByBayes'Rule:
#Chancethatthepersonhasthedisease,giventhattestwas+
(0.01*0.99)/(0.01*0.99+0.99*0.005)
0.6666666666666666
Thisis2/3,anditseemsrathersmall.Thetesthasveryhighaccuracy,99%orhigher.Yetisouranswersayingthatifapatientgetstestedandthetestresultispositive,thereisonlya2/3chancethatheorshehasthedisease?
Tounderstandouranswer,itisimportanttorecallthechancemodel:ourcalculationisvalidforapersonpickedatrandomfromthepopulation.Amongallthepeopleinthepopulation,thepeoplewhotestpositivesplitintotwogroups:thosewhohavethediseaseandtestpositive,andthosewhodon'thavethediseaseandtestpositive.Thelattergroupiscalledthegroupoffalsepositives.Theproportionoftruepositivesistwiceashighasthatof
ComputationalandInferentialThinking
430ConditionalProbability
thefalsepositives– comparedto –andhencethechanceofatruepositivegivenapositivetestresultis .Thechanceisaffectedbothbytheaccuracyofthetestandalsobytheprobabilitythatthesampledpersonhasthedisease.
Thesameresultcanbecomputedusingatable.Below,webeginwiththejointdistributionovertruehealthandtestresults.
rare=Table(['Health','Test','Chance']).with_rows([
['Diseased','Positive',0.01*0.99],
['Diseased','Negative',0.01*0.01],
['NotDiseased','Positive',0.99*0.005],
['NotDiseased','Negative',0.99*0.995]
])
rare
Health Test Chance
Diseased Positive 0.0099
Diseased Negative 0.0001
NotDiseased Positive 0.00495
NotDiseased Negative 0.98505
Thechancethatapersonselectedatrandomisdiseased,giventhattheytestedpositive,iscomputedfromthefollowingexpression.
positive=rare.where('Test','Positive')
P(given(positive).where('Health','Diseased'))
0.66666666666666663
Butapatientwhogoestogettestedforadiseaseisnotwellmodeledasarandommemberofthepopulation.Peoplegettestedbecausetheythinktheymighthavethedisease,orbecausetheirdoctorthinksso.Insuchacase,sayingthattheirchanceofhavingthediseaseis1%isnotjustified;theyarenotpickedatrandomfromthepopulation.
So,whileouransweriscorrectforarandommemberofthepopulation,itdoesnotanswerthequestionforapersonwhohaswalkedintoadoctor'sofficetobetested.
ComputationalandInferentialThinking
431ConditionalProbability
Toanswerthequestionforsuchaperson,wemustfirstaskourselveswhatistheprobabilitythatthepersonhasthedisease.Itisnaturaltothinkthatitislargerthan1%,asthepersonhassomereasontobelievethatheorshemighthavethedisease.Buthowmuchlarger?
Thiscannotbedecidedbasedonrelativefrequencies.Theprobabilitythataparticularindividualhasthediseasehastobebasedonasubjectiveopinion,andisthereforecalledasubjectiveprobability.Someresearchersinsistthatallprobabilitiesmustberelativefrequencies,butsubjectiveprobabilitiesabound.Thechancethatacandidatewinsthenextelection,thechancethatabigearthquakewillhittheBayAreainthenextdecade,thechancethataparticularcountrywinsthenextsoccerWorldCup:noneofthesearebasedonrelativefrequenciesorlongrunfrequencies.Eachonecontainsasubjectiveelement.
Itisfinetoworkwithsubjectiveprobabilitiesaslongasyoukeepinmindthattherewillbeasubjectiveelementinyouranswer.Beawarealsothatdifferentpeoplecanhavedifferentsubjectiveprobabilitiesofthesameevent.Forexample,thepatient'ssubjectiveprobabilitythatheorshehasthediseasecouldbequitedifferentfromthedoctor'ssubjectiveprobabilityofthesameevent.Herewewillworkfromthepatient'spointofview.
Supposethepatientassignedanumbertohis/herdegreeofuncertaintyaboutwhetherhe/shehadthedisease,beforeseeingthetestresult.Thisnumberisthepatient'ssubjectivepriorprobabilityofhavingthedisease.
Ifthatprobabilitywere10%,thentheprobabilitiesontheleftsideofthetreediagramwouldchangeaccordingly,withthe0.1and0.9nowinterpretedassubjectiveprobabilities:
Thechangehasanoticeableeffectontheanswer,asyoucanseebyrunningthecellbelow.
ComputationalandInferentialThinking
432ConditionalProbability
#Subjectivepriorprobabilityof10%thatthepersonhasthedisease
#ByBayes'Rule:
#Chancethatthepersonhasthedisease,giventhattestwas+
(0.1*0.99)/(0.1*0.99+0.9*0.005)
0.9565217391304347
Ifthepatient'spriorprobabilityofhavingthediseaseis10%,thenafterapositivetestresultthepatientmustupdatethatprobabilitytoover95%.Thisupdatedprobabilityiscalledaposteriorprobability.Itiscalculatedafterlearningthetestresult.
Ifthepatient'spriorprobabilityofhavngthediseaseis50%,thentheresultchangesyetagain.
ComputationalandInferentialThinking
433ConditionalProbability
#Subjectivepriorprobabilityof50%thatthepersonhasthedisease
#ByBayes'Rule:
#Chancethatthepersonhasthedisease,giventhattestwas+
(0.5*0.99)/(0.5*0.99+0.5*0.005)
0.9949748743718593
Startingoutwitha50-50subjectivechanceofhavingthedisease,thepatientmustupdatethatchancetoabout99.5%aftergettingapositivetestresult.
ComputationalNote.Inthecalculationabove,thefactorof0.5iscommontoallthetermsandcancelsout.Hencearithmeticallyitisthesameasacalculationwherethepriorprobabilitiesareapparentlymissing:
0.99/(0.99+0.005)
0.9949748743718593
Butinfact,theyarenotmissing.Theyhavejustcanceledout.Whenthepriorprobabilitiesarenotallequal,thentheyareallvisibleinthecalculationaswehaveseenearlier.
Inmachinelearningapplicationssuchasspamdetection,Bayes'Ruleisusedtoupdateprobabilitiesofmessagesbeingspam,basedonnewmessagesbeinglabeledSpamorNotSpam.Youwillneedmoreadvancedmathematicstocarryoutallthecalculations.Butthefundamentalmethodisthesameaswhatyouhaveseeninthissection.
ComputationalandInferentialThinking
434ConditionalProbability