curious journalist s guide to data

120

Upload: ivanjuarez

Post on 12-Jul-2016

8 views

Category:

Documents


1 download

DESCRIPTION

Dedicated for Journalists who want to know more about organizing and presenting data

TRANSCRIPT

Page 1: Curious Journalist s Guide to Data
Page 2: Curious Journalist s Guide to Data

0

1

2

3

3.1

3.2

3.3

3.4

3.5

3.6

4

4.1

4.2

4.3

4.4

4.5

4.6

4.7

4.8

5

5.1

5.2

5.3

5.4

5.5

5.6

6

7

8

TableofContentsIntroduction

Dedication

Introduction

Quantification

TheQuantitiesofEverydayLanguage

CountingRace

TheProblemofWhattoCount

SamplingandQuantifiedError

TheProblemofMeasurementError

QuantificationIsRepresentation

Analysis

DidthePolicyWork?

AccountingforChance

CountingPossibleWorlds

ArguingFromtheOdds

StatisticalInference

WhatWouldHaveHappenedAnyway?

CausalModels

TruthbyElimination

Communication

Perception

Representation

ExamplesTrumpStatistics

WhoIsintheData?

CommunicatingUncertainty

Prediction

GoingFurther

Footnotes

Citations

TheCuriousJournalist'sGuidetoData

2

Page 3: Curious Journalist s Guide to Data

TheCuriousJournalist'sGuidetoData

3Introduction

Page 4: Curious Journalist s Guide to Data

DedicationForeveryjournalistwhohaseverthoughtthey’rebadatmath.Whatifyou’rewrong?

AcknowledgmentsThankyoutotheTowCenterforDigitalJournalismforthefellowshipthatsupportedthewritingofthiswork.Icouldnothavedonethisotherwise.I’mindebtedtoMarkHansenforreadingnotonebuttwolongdraftsandprovidingexpansivefeedback.AndrewGelmankindlyreviewedthe“Analysis”chapterandreallyshapedmythinkingoncausation.KennethPrewittreadthematerialoncensusandracewithanexperteye;anyremainingblundersaremyown.I’mindebtedtoresearchdirectorsTaylorOwenandClaireWardlefortheirpatienteffortsastheyshepherdedmethroughtheprocessovernearlytwoyears.I’mdeeplygratefultoEmilyBellforhersupportovertheyears,andthefantasticopportunitytoteachatColumbia.Mywarmestshout-outtothestudentsofmyFrontiersofComputationalJournalismcourse,whotaughtmewhatitistoteach—andsometimesschooledmewiththeirownwork.You’vebeenmoreinfluentialthanyouknow.AndthankyoutoSaraforhelpingmefindthebook’stitle.

March2016

TheCuriousJournalist'sGuidetoData

4Dedication

Page 5: Curious Journalist s Guide to Data

IntroductionThisisabookaboutusingdatainjournalism,butit’snotaparticularlypracticalbook.Insteadit’sforthecurious,forthosewhowonderaboutthedeepideasthatholdeverythingtogether.Someoftheseideasareveryold,somehaveemergedinjustthelastfewdecades,andmanyofthemhavecometogethertocreatetheparticularlytwenty-first-centurypracticeofdatajournalism.

We’llcoversomeofthemathypartsofstatistics,butalsothedifficultyoftakingacensusofraceandthecognitivepsychologyofprobabilities.We’lltracewheredatacomesfrom,whatjournalistsdowithit,andwhereitgoesafter—andtrytounderstandthepossibilitiesandlimitations.Datajournalismisasinterdisciplinaryasitgets,whichcanmakeitdifficulttoassembleallthepiecesyouneed.Thisisoneattempt.

Therearefewequationsandnocodeinthisbook,andIdon’tassumeyouknowanythingaboutmath.ButIamassumingyouwanttoknow,soI’mgoingtodevelopsomekeyideasfromthegroundup.Ormaybeyou’vestudiedatechnicalfieldandyouarejustcomingintojournalism,inwhichcaseIhopethisbookhelpsyouunderstandhowyourskillsapply.Thisisaframework,acollectionofbigideasjournalistscanstealfromotherfields.Iwanttogiveafootholdintostatisticalanalysisinallitsnerdysplendor,butequallyshowhowethnographycanhelpyouinterpretcrimefigures.

We’regoingtolookatdataalotmorecloselythanyoumightbeusedto.ConsiderthisgraphoftheU.S.unemploymentrateoverthelast10years.Thereisawholeworldjustbeneaththesurfaceofthisimage.

TheCuriousJournalist'sGuidetoData

5Introduction

Page 6: Curious Journalist s Guide to Data

FromtheU.S.BureauofLaborStatistics.

It’sclearthatalotofpeoplelosttheirjobsafterthe2008financialcrash.Youcanreadthischartandsayhowmany:Theunemploymentratewentupby5percent.Thisisaveryordinary,veryreasonablewayoftalkingaboutthisdata,exactlythesortofthingthatshouldpopintoyourheadwhenyouseethisimage.We’regoingtolookdeeper.

Wheredidthesenumberscomefrom?Whatdotheyactuallycount?Whatcanthejournalistsayaboutthisdata,inlightofrecenthistory?Whatshouldtheaudiencedoafterseeingit?Whydowebelievechartslikethis,andshouldwe?Howisanunemploymentchartanybetter,ordifferent,thanjustaskingpeopleabouttheirpost-crashlives?

What’sthedatareallydoingforushere?

Thisbookisaboutbringingthequantitativetraditionintojournalism.Dataisnotjustnumbers,butnumberswerethefirstformofdata.Theveryfirstwritingsystemswereused

foraccounting,longbeforetheyweresophisticatedenoughforlanguage.1Atthattimetherulesofadditionmusthaveseemedincrediblyarcane(inbase60,atfirst!),anditmusthavebeenapowerfultricktobeabletotellinadvancehowmanystonesyouwouldneedforabuildingofacertainsize.Thereisnodoubtthatnumbers,likewords,areatypeofpracticalmagic,andcountingisthefoundationofdataworktothisday.Butyoualreadyknowhowtocount.Sowe’remostlygoingtotalkaboutideasthatweredevelopedduringTheEnlightenment,thenmassivelyrefinedandexpandedinthetwentiethcenturywithmodernstatisticsandcomputers.

TheCuriousJournalist'sGuidetoData

6Introduction

Page 7: Curious Journalist s Guide to Data

We’llneedtogowelloutsideofstatisticstomakeanysenseofthings.I’vebeenraidingpsychologyandsocialscienceandethnography,andfurtherplacestoolikeintelligenceanalysisandtheneurobiologyofvision.I’vebeencollectingpieces,hopingtousedatamorethoughtfullyandeffectivelyinmyjournalismwork.I’vetriedtoorganizethethingsthatcanbesaidintothreeparts:Quantificationiswhatmakesdata,thenthejournalistanalyzesit,thentheresultiscommunicatedtotheaudience.Thisprocesscreates“stories,”thecentralproductsofjournalism.

Injournalism,astoryisanarrativethatisnotonlytruebutinterestingandrelevanttotheintendedaudience.Datajournalismisdifferentfrompurestatisticalanalysis—ifthereissuchathing—becauseweneedculture,law,andpoliticstotelluswhatdatamattersandhow.Aprocurementdatabasemaytellusthatthecitycouncilorhasbeenhandingoutlucrativecontractstohisbrother.Butthisisinterestingonlyifweunderstandthissortofthingas“corruption”andwe’vedecidedtolookforit.Asportsjournalistmightlookforentirelydifferentstoriesinthesamedata,suchaswhetherornotthecityisactuallygoingtobuildthatproposednewstadium.Thedataalonedoesn’tdeterminethestory.Butthestorystillhastobetrue,andhopefullyalsothoroughandfair.Whatexactlythatmeansisn’talwaysobvious.Therelationshipbetweenstory,data,culture,andtruthisoneofthekeyproblems

oftwenty-first-centuryjournalism.i

Theprocessofquantification,analysis,andcommunicationisacycle.Aftercommunicatingaresultyoumayrealizethatyouwantadifferentanalysisofthesamedata,ordifferentdataentirely.Youmightenduprepeatingthisprocessmanytimesbeforeanythingiseverpublished,exploringthedataandcommunicatingprimarilytoyourselfandyourcolleaguestofindandshapethestory.Orthesestepsmighthappenforeachofmanystoriesinalongseries,withfeedbackfromtheaudiencedirectingthecourseofyourreporting.Andsomewhere,atsomepoint,theaudienceactsonwhatyouhavecommunicated.Otherwise,journalismwouldhavenoeffectatall.

Databeginswithquantification.Dataisnotsomethingthatexistsinnature,andunemployedpeopleareaverydifferentthingthanunemploymentdata.Whatiscountedandhow?

ThereareatleastsixdifferentwaysthattheU.S.governmentcountswhoisunemployed,

whichgiverisetodatasetslabeledU1toU6.2Theofficialunemploymentrate—thegovernmentcallsoneofthem“official”—isknownasU3.ButU3doesnotcountpeoplewhogaveuplookingforajob,asU4does,orpeoplewhoholdpart-timejobsbecausetheycan’tgetafull-timejob,asU6does.

Andthissaysnothingabouthowthesestatisticsareactuallytabulated.NoonegoesaroundaskingeverysingleAmericanabouttheiremploymentstatuseverysinglemonth.Theofficialnumbersarenot“raw”countsbutmustbederivedfromotherdatainavastand

TheCuriousJournalist'sGuidetoData

7Introduction

Page 8: Curious Journalist s Guide to Data

sophisticatedongoingestimationprocessbasedonrandomsampling.Unemploymentfigures,beingestimates,havestatisticalestimationerror—farmoreerrorthangenerally

realized.Thismakesmoststoriesaboutshort-termincreasesordecreasesirrelevant.3

Thereisacomplexrelationshipbetweentheideaconveyedbythewords“unemploymentrate”andtheprocessthatproducesaparticularsetofnumbers.Normallyallofthisisbackstage,hiddenbehindthechart.it’sthesameforanyotherdata.Dataiscreated.Itisarecord,adocument,anartifact,drippingwithmeaningandcircumstance.Amachinerecordedanumberatsomepointonsomemedium,oraparticularhumanonaparticulardaymadeajudgmentthatsomeaspectoftheworldwasthisandnotthat,andmarkeda0ora1.Evenbeforethat,someonehadtodecidethatsomesortofinformationwasworthrecording,hadtoconceiveofthecategoriesandmeaningsandwaysofmeasurement,and

hadtosetupthewholeapparatusofdataproduction.ii

Dataproductionisanelaborateprocessinvolvinghumans,machines,ideas,andreality.Itissocial,physical,andspecifictotimeandplace.I’mgoingtocallthiswholeprocess“quantification,”awordwhichI’llusetoincludeeverythingfromdreamingupwhatshouldbecountedtowiringupsensors.

Ifquantificationturnstheworldintodata,analysistellsuswhatthedatameans.Hereiswherejournalismleansmostheavilyontraditionalmathematicalstatistics.Ifyou’vefound

statisticsdifficulttolearn,it’snotyourfault.Ithasbeenterriblytaught.4Yettheunderlyingideasarebeautifulandsensible.Thesefoundationalprinciplesleadtocertainrulesthatguideoursearchfortruth,andwewantthoserules.Itishardtoforgivearithmeticerrorsorareporter’sconfusedcausality.Journalismcandemanddeepandspecifictechnicalknowledge.It’snoplaceforpeoplewhowanttoavoidmath.

Supposeyouwanttoknowiftheunemploymentrateisaffectedby,say,taxpolicy.Youmightcomparetheunemploymentratesofcountrieswithdifferenttaxrates.Thelogichereissound,butasimplecomparisoniswrong.Agreatmanythingscananddoaffecttheunemploymentrate,soit’sdifficulttoisolatejusttheeffectoftaxes.Evenso,youcanbuildstatisticalmodelstohelpyouguesswhattheunemploymentratewouldhavebeenifallfactorsotherthantaxpolicywerethesamebetweencountries.We’renowtalkingaboutimaginaryworlds,derivedfromtherealthroughforceoflogic.That’satrickything—notalwayspossible,andnotalwaysdefensibleevenwhenformallypossible.Butwedohavehundredsofyearsofguidancetohelpus.

Journalistsarenoteconomists,ofcourse.They’renotreallyspecialistsofanykind,especiallyifjournalismisalltheyhavestudiedandpracticed.Wealreadyhaveeconomists,epidemiologists,criminologists,climatologists,andonandon.Butjournalistsneedtounderstandthemethodsofanyfieldtheytouch,ortheywillbeunabletotellgoodworkfrombad.Theywon’tknowwhichanalysesareworthrepeating.Evenworse,theywillnot

TheCuriousJournalist'sGuidetoData

8Introduction

Page 9: Curious Journalist s Guide to Data

understandwhichdatamatters.And,increasingly,journalistsareattemptingtheirownanalyseswhentheydiscoverthattheknowledgetheywantdoesnotyetexist.Journalistsaren’tscientists,buttheyneedtounderstandwhatscienceknowsaboutevidenceandinference.

Therearefewoutrightequationsinthisbook,butitisatechnicalbook.Iusestandardstatisticallanguageandtrytodescribeconceptsfaithfullybutmostlyskiptheformaldetails.Wheneveryouseeawordinitalicsthatmeansyoucangolookitupelsewhere.Eachtechnicaltermisagatewaytowholeworldsofspecializedknowledge.Ihopethisbookgivesyouahigh-levelviewofhowstatisticaltheoryisputtogether,soyou’llknowwhatyou’retryingtodoandwhereyoumightlookfortheappropriatepieces.

Afteranalysiscomescommunication.Thismakesjournalismdifferentfromscholarshiporscience,oranyfieldthatproducesknowledgebutdoesn’tfeelthecompulsiontotellthepublicaboutitinanunderstandableway.Journalismisfortheaudience—whichisoftenaverybroadaudience,potentiallymillionsofpeople.

Communicationdependsonhumancultureandcognition.Astoryincludesanunemploymentchartbecauseit’sabetterwayofcommunicatingchangesintheunemploymentratethanatableofnumbers,whichistruebecausehumaneyesandbrainsprocessvisualinformationinacertainway.Yourvisualsystemisattunedtotheorientationoflines,whichallowsyoutoperceivetrendswithoutconsciouseffort.Thisisaremarkablefactwhichmakesdatavisualizationpossible!Anditshowsthatdatajournalistsneedtounderstandquantitativecognitioniftheywanttocommunicateeffectively.

Fromexperienceandexperimentsweknowquitealotabouthowmindsworkwithdata.Rawnumbersaredifficulttointerpretwithoutcomparisons,whichleadstoallsortsofnormalizationformulas.Variationtendstogetcollapsedintostereotypes,anduncertaintytendstobeignoredaswelookforpatternsandsimplifications.Riskispersonalandsubjective,buttherearesensiblewaystocompareandcommunicateodds.

Butmorethanthesetechnicalconcernsisthequestionofwhatisbeingsaidaboutwhom.Journalismissupposedtoreflectsocietybacktoitself,butwhoisthe“we”inthedata?Certainpeopleareexcludedfromanycount,andastonishingvariationisabstractedintouniformity.Theunemploymentratereduceseachvoicetoasinglebit:areyoulookingforwork,yes/no?Avastsocialmediadatasetseemslikeitoughttotellusdeeptruthsaboutsociety,butitcannotsayanythingaboutthepeoplewhodon’tpost,orthethingstheydon’tpostabout.Omnisciencesoundsfantastic,butdataisamapandnottheterritory.

Andthenthere’stheaudience.Whatsomeoneunderstandswhentheylookatthedatadependsonwhattheyalreadybelieve.Ifyouaren’tunemployedyourself,youhavetorelyonsomeimageof“unemployedperson”tobringmeaningtotheideaofanunemploymentrate.Thatimagemaybepositiveornegative,itmaybejustifiedoruntrue,butyouhavetofill

TheCuriousJournalist'sGuidetoData

9Introduction

Page 10: Curious Journalist s Guide to Data

intheideaofunemploymentwithsomethingtomakeanysenseatallofunemploymentstatistics.Datacandemolishorreinforcestereotypes,soit’simportantforthejournalisttobeawarethatthesestereotypesareinplay.Thatisonereasonwhyit’snotenoughfordatatobepresented“accurately.”Wehavetoaskwhattherecipientwillendupbelievingabouttheworld,andaboutthepeoplerepresentedbythedata.Often,dataisbestcommunicatedbyconnectingittostoriesfromtheindividuallivesitrepresents.

We’renotquitedone.Iwantaction.Someoneeventuallyhastoactonwhatthey’velearnedifjournalismisgoingtomeananythingatall,andactionisapowerfullyclarifyingperspective.Knowingtheunemploymentrateisinteresting.Muchbetterisknowingthataspecificplanwouldplausiblycreatejobs.Thissortofdeepresearchwillusuallybedonebyspecialists,butjournalistshavetounderstandenoughtoactasacommunicatorandanindependentcheck.Asamediaprofessional,ajournalisthasboththepowerandresponsibilitytodecidewhatisworthrepeating.

Datacannottelluswhattodo,butitcansometimestellusaboutconsequences.Thetwentiethcenturysawgreatadvancesinourunderstandingofcausalityandprediction.Butpredictionisveryhard.Mostthingscan’tbepredictedwell,forfundamentalreasonssuchaslackofdata,intrinsicrandomness,freewill,orthebutterflyeffect.Theseareprofoundlimitstowhatwecanknowaboutthefuture.Yetwherepredictionispossible,thereisconvincingevidencethatdataisessential.Purelyqualitativemethods,nomatterhowsophisticated,justdon’tseemtobeasaccurate.Statisticalmethodsareessentialforjournalismthataskswhatwillhappen,whatshouldbedone,orhowbesttodoit.

Thisdoesn’tmeanwecanjustruntheequationsforwardandreadoffwhattodo.We’veseenthatdreambefore.Atanindividuallevel,theancientdesireforuniversalquantificationcanbeasourceofmathematicalinspiration.Leibnizdreamedofanunambiguouslanguageof“universalcharacter.”Threecenturieslater,thefailureofthesymboliclogicparadigminartificialintelligencefinallyshowedhowimpracticalthatis,buttheexercisewasenormouslyproductive.Thedesireforuniversalquantificationhasn’tworkedoutquitesowellatasocietallevel.Everyauthoritarianplannerdreamsofutopia,buttotalitariantechnocraticvisionshavebeenuniformlydisastrousforthepeoplelivinginthem.Afullyquantifiedsocialorderisaninsulttofreedom,andtherearegoodreasonstosuspectsuchsystemswill

alwaysbedefeatedbytheirrigidity.5Questionsofactioncanhoneandrefinedatawork,butactualaction—makingachoiceanddoing—requirespracticalknowledge,wisdom,andcreativity.Theuseofstatisticsinjournalism,liketheuseofstatisticsingeneral,willalwaysinvolveartistry.

Allofthisisimplicitineveryuseofdatainjournalism.Allofitisjustbelowthesurfaceofanunemploymentchartinthenews,tosaynothingofthedazzlingvisualizationsthatjournalistsnowcreate.Journalismdependsonwhatwehavedecidedtocount,thetechniquesusedto

TheCuriousJournalist'sGuidetoData

10Introduction

Page 11: Curious Journalist s Guide to Data

interpretthosecounts,howwedecidetoshowtheresults,andwhathappensafterwedo.Andthentheworldchanges,andwereportagain. 

TheCuriousJournalist'sGuidetoData

11Introduction

Page 12: Curious Journalist s Guide to Data

QuantificationThemathematicalmodelingtoolsweemployatonceextendandlimitourabilitytoconceive

theworld.-DavidHestenes6

TherewerenoHispanicslivingintheUnitedStatesbefore1970.Atleast,thereweren’taccordingtothecensus.Therecouldn’tbe,becausethecensusformdidnotinclude

“Hispanic”or“Latino”oranythinglikeit.iii

ActuallytherewereaboutninemillionHispanicslivinginthecountryby1970.7Inmanywaysthelackofcensusdatamadetheminvisible.Youcouldn’tsaywithcertaintywheretheywereliving.Itwouldhavebeendifficulttoknowhowthehealth,education,andincomeofHispanicfamiliescomparedtootherfamilies,muchlesscontemplatewaystoclosethegaps.Youwouldn’tevenknowhowmanypeoplemightbeaffectedifyoudid.

Quantificationistheprocessthatcreatesdata.Youcanonlymeasurewhatyoucanconceive.That’sthefirstchallengeofquantification.Thenextchallengeisactuallymeasuringit,andknowingthatyoumeasureditaccurately.Dataisonlyusefulbecauseitrepresentstheworld,butthatlinkcanbefragile.Atsomepoint,somepersonormachinecountedormeasuredorcategorized,andrecordedtheresult.Thewholeprocesshastoworkjustright,andourunderstandingofexactlyhowitallworkshastobecorrect,orthedatawon’tbemeaningful.

Sometimesthisisnotasimplethingtodo.Itseemsclearenoughhowtoquantifythenumberofcarssoldortheamountofgrainexported,wherecountinghasthefeelofsomethingobjectiveanddefinite.Butjournalistsareinterestedinmanyotherthingswheretheproperrelationshipbetweenthewords,thenumbers,andtheworldismuchlessclear.

Aremassshootingsmoreorlesscommontodaythan10yearsago?WhatfractionofthepopulationisHispanic?Howmanypeoplesufferfromdepression?Theseseemlikequestionsthatcountingcananswer,but“massshootings,”“Hispanics,”and“depression”arenoteasythingstocount.Who,precisely,countsasdepressed?Andhowwouldyoudeterminethenumberofdepressedpeopleintheentirecountry?

Quantificationisaproblemwithoutahome.Statisticiansandcomputerscientistsdonotnormallyspendalotoftimeaskinghowdatacametobe.Actually,theirmethodsarepowerfulpreciselybecausetheyareabstract.Physicistsandengineerswerethefirsttothinkseriouslyaboutquantification,andtheyhavecarefullydevelopedtheprocessesofmeasurementovermanycenturies.Eveninsuch“hard”disciplinestherearemanychoicesthatmustbemadeaboutwhatgetsmeasured,butthesefieldsusuallyonlydealwithquantitiesthatcanbeexpressedintheunitsofphysics.Econometricsbroadenedthe

TheCuriousJournalist'sGuidetoData

12Quantification

Page 13: Curious Journalist s Guide to Data

horizons,butitispsychologistsandsocialscientistswhohavethoughtmostdeeplyaboutthequantificationofpeopleandsocieties,thesortsofquantificationsthatareoftenmost

interestingandmostvexingtoajournalist.iv

I’mgoingtotrytogivetheflavoroftheproblemsofquantificationwithtwoexamples:recordingsomeone’sraceinadatabaseandestimatingthemonthlyunemploymentrate.Thefirstisaparableaboutthedifficultyofcategories.Thesecondisatourthroughthebeautifulideasofrandomsamplingandquantifieduncertaintysocentraltomodernstatisticalwork.Butbeforewecangetthere,wehavetotalkaboutwhatmakessomething“quantitative”atall.

TheCuriousJournalist'sGuidetoData

13Quantification

Page 14: Curious Journalist s Guide to Data

TheQuantitiesofEverydayLanguageQuantityisanancientidea,soancientthatitappearsatthecoreofeveryhumanlanguage.Wordslike“less”and“every”areobviouslyquantitative,andleadtomorecomplexconceptslike“trend”and“significant.”Quantitativethinkingstartswithrecognizingwhenyouaretalkingaboutquantities.

Spotthequantitativeideasinthissentencefromthearticle“Anti-IntellectualismisKillingAmerica,”whichappearedinPsychologyToday:

InacountrywhereasittingcongressmantoldacrowdthatevolutionandtheBigBangare“liesstraightfromthepitofhell,”wherethechairmanofaSenateenvironmentalpanelbroughtasnowballintothechamberasevidencethatclimatechangeisahoax,wherealmostoneinthreecitizenscan’tnamethevicepresident,itisbeyonddispute

thatcriticalthinkinghasbeenabandonedasaculturalvalue.>8

Thisispureculturalcritique,andwecouldtakeitmanydifferentways.Wecouldreadthissentenceasarant,aplea,anaffirmation,aprovocation,alistofexamples,oranyothertypeofexpression.Maybeit’sart.Butjournalismistraditionallyunderstoodas“nonfiction,”solet’stakethisatfacevalueandaskwhetherit’strue.

Iseeanempiricalandquantitativeclaimattheheartofthephrase“criticalthinkinghasbeenabandonedasaculturalvalue.”it’sempiricalbecauseitspeaksaboutsomethingthatishappeningintheworld,somethingwithobservableconsequences.it’squantitativebecausetheword“abandoned”speaksaboutcomparingtheamountofsomethingattwodifferenttimes.Somethingweneverhadcan’tbeabandoned.

Foratleasttwopointsintimeweneedtodecidewhetherornot“criticalthinkingisaculturalvalue.”Thisisthemomentofquantification.“Abandoned”mighthaveanall-or-nothingflavor,butit’sprobablyalotmorereasonabletodefineshadesofgraybasedonthenumberofpeopleandinstitutionsthatareembodyingthevalueofcriticalthinking;orperhapsitmakessensetolookathowmanyactsofcriticalthinkingareoccurring.Ofcourse“criticalthinking”isnotaneasythingtopindownbutifwechooseanydefinitionatallweareliterallydecidingwhichthings“count”ascriticalthinking.Thenextstepistocomeupwithapracticalplantocountthosethings.Ifwecan’torwon’tcountinpractice,there’snoquantitativewaytotestthisclaimagainstreality.it’snotthatthesentencewouldthenmeannothing,it’sjustthatitsmeaningcouldn’tbeevaluatedbycomparingthewordswiththeworldinayes/nokindofway.

TheCuriousJournalist'sGuidetoData

14TheQuantitiesofEverydayLanguage

Page 15: Curious Journalist s Guide to Data

Onewayoranother,testingtheclaimthat“criticalthinkinghasbeenabandonedasaculturalvalue”demandsthatwecountsomethingattwodifferentpointsintimeandlookforadropinthenumber.Therearesurelyfightswaitingtohappenoverwhatshouldbecounted,whetheritwascorrectlycounted,andthenumericalthresholdfor“abandoned.”Butifyou’rewillingtomakesomechoices,youcangooutandfindrelevantfacts.Thisiswhattheauthor’sgivenus:

asittingcongressmantoldacrowdthatevolutionandtheBigBangare“liesstraightfromthepitofhell”

thechairmanofaSenateenvironmentalpanelbroughtasnowballintothechamberasevidencethatclimatechangeisahoax

almostoneinthreecitizenscan’tnamethevicepresident

Evenifthesewereallgoodexamplesofafailureof“criticalthinking,”theystillwouldn’tbegoodevidencefortheideathatcriticalthinkinghasbeenabandoned.Theproblemisthattheauthoristryingtosaysomethingaboutaverylargegroupofpeople.Theseexampleswouldneedtoberepresentative.Arethesefailuresofcriticalthinkingtypicalofthewholesociety?Itseemsjustaseasytocomeupwithcounterexamples.Yeah,someonebroughtasnowballintoCongresstoargueagainstclimatechange,buttheEPAalsorecentlydecidedtostartregulatingcarbondioxideasapollutant.That’sevidenceagainsttherepresentativenessoftheauthor’sexamples,butofcourseyoucoulddigupamillionmoreexamplesoneachside.That’swherecountinggetsinteresting:it’sasystematicwaytograspthewholeofsomething,whichcanleadtomuchstrongerstatements.

That’sthelogicbehindhistorianG.KitsonClark’sadviceformakinggeneralizations:

Donotguess;trytocount.Andifyoucannotcount,admitthatyouareguessing.9

Thefactthat“oneinthreecitizenscan’tnamethevicepresident”isclosertothesortofevidenceweneed.Thisstatementgeneralizesinawaythatindividualexamplescan’t,becauseitmakesaclaimaboutallU.S.citizens.Itdoesn’tmatterhowmanypeopleIcannamewhoknowwhothevicepresidentis,becauseweknow(bycounting)thatthereare100millionwhocannot.Butthisstillonlyaddressesonepointintime.Werethingsbetterbefore?Wasthereanypointinhistorywheremorethantwo-thirdsofthepopulationcouldnamethevice-president?Wedon’tknow.

Inshort,theevidenceinthissentenceisnottherighttype.Theword“abandoned”hasembeddedquantitativeconceptsthatarenotbeingproperlyhandled.Weneedsomethingtestedormeasuredorcountedacrosstheentirecultureattwodifferentpointsintime,andwedon’thavethat—noneofwhichmakesthisa“bad”pieceofwriting.Itmightprovokethereadertothinkaboutthevalueofcriticalthinking.Itmightbeemotionallyresonant.Itmightdrawattentiontoimportantexamples.Itmightevenbepersuasive.Whetherit’sgoodornot

TheCuriousJournalist'sGuidetoData

15TheQuantitiesofEverydayLanguage

Page 16: Curious Journalist s Guide to Data

dependsonwhatyouwantittodo.Butintermsofempiricalclaimsandtheevidenceprovidedforthem,thisisaweakargument.Itdoesn’trespectthequantitativestructureofthelanguageituses.

Manywordshavequantitativeaspects.Wordslike“all,”“every,”“none,”and“some”aresoexplicitlyquantitativethatthey’recalledquantifiersinmathematics.Comparisonslike“more”and“fewer”areclearlyaboutcounting,butmuchricherwordslike“better”and“worse”alsoimplycountingormeasuringatleasttwothings.Therearewordsthatcomparedifferentpointsintime,like“trend,”“progress,”and“abandoned.”Therearewordsthatimplymagnitudessuchas“few,”“gargantuan,”and“scant.”AseriesofGreekphilosophers,longbeforeChrist,showedthatthemeaningsof“if,”“then,”“and,”“or,”and“not”couldbecapturedsymbolicallyaspropositionallogic.Tobesure,allofthesewordshavemeaningsandresonancesfarbeyondthemathematical.Buttheylosetheircentralmeaningifthequantitativecoreisignored.

We’rereallytakinglanguageaparthere,andnoonecouldmakeitthroughadayiftheyhadtofactcheckeverysentencetheyread.Also,thereareotherwaysofrelatingtoastory.Butthisisawayofseeingthateveryjournalistshouldhaveintheirtoolbox—andpassontoreaderswhenhelpful.Therelationbetweenwordsandnumbersisoffundamentalimportancetothepursuitoftruth.Ittellsyouwhenyoushouldbecountingsomething.

TheCuriousJournalist'sGuidetoData

16TheQuantitiesofEverydayLanguage

Page 17: Curious Journalist s Guide to Data

CountingRaceIn2004,thegovernmentofFloridadrewupalistoffelonswhowereineligibletovote.Itdidthisbymatchingnamesbetweenacriminalrecordsdatabaseandaregisteredvoterdatabase.Thecourtsorderedthatthelistbereleasedpublicly,andshortlythereafterthe

SarasotaHerald-TribunediscoveredthattherewerealmostnoHispanicsonthelist.10

Thisseemedimpossible.Hispanicsmadeup17percentofthepopulationbutonlyone-tenthof1percentofthelist;therewereonly61Hispanicpeopleonthelistof47,763names.Atthetime,Florida’sHispanicvotersweremostlyCubanswhosupportedtheRepublicanParty.Iftheyweren’tonthelist,theywouldbeallowedtovote.Therewereaccusationsofpoliticallymotivatedfraud.

Morediggingrevealedthatthiswasnotactuallyapoliticalmaneuverbutadataproblem.Inthestate’svoterdatabase,Hispanicisa“race.”Inthecriminalhistorydatabase,Hispanicisan“ethnicity.”Thesameinformationwasconceivedintwodifferentways,soitwasrecordedintwodifferentfieldsintwodifferentsystems.Topreventfalsematchesbasedonnamealone,thegovernmenthadchosentomatchonname,dateofbirth,and“race”butnot

“ethnicity.”Thus,HispanicfelonscouldnevermatchHispanicvoters.11

Whichdatabaseschemaiscorrect?IsHispanicanethnicityorarace?Thissoundslikeacultural,social,orevenphilosophicalquestion,butinthiscontextit’sreallyaquestionabouttheprocessofcounting.Afterall,thesedatabasesareconcreteobjects,createdbyhumans.Atsomepointtherewasadecisionthateachpersonwas,orwasnot,Hispanic,andthisvaluewasrecordedineitherthe“race”or“ethnicity”column.

Howdoyouassignaracialcategorytoeachperson,orevendecidewhatthosecategoriesshouldbe?ThisisaproblemthattheU.S.Censushassolved,forbetterorworse,forover200years.

ArticleI,Section2ofthe1787Constitutionestablishedthecensusanddividedpeopleintothreecategories:“freepersons”;“Indiansnottaxed”;and“otherpersons,”whichreallymeant“slaves.”Althoughalignedwithrace,thesewerealsopoliticalcategoriesbecausethecensuswascreatedtoapportionrepresentativesandtaxesbetweenthestates.Indianscountedforneitherrepresentationnortaxes,whileslaveswereonlycountedasthree-fifthsofaperson.Thiswasthecompromisebetweentheslaveandnon-slavestatesthatcreatedthecountry.Itseemsinsanenow,butthat’sthehistory,andareminderthatthecensusisnotan“objective”countbutabureaucraticprocessthatgeneratesdataforspecificpurposes.Askingwhythedatawascollecteddoesnotanswerhowitwascollected,butit’softenabighint.

TheCuriousJournalist'sGuidetoData

17CountingRace

Page 18: Curious Journalist s Guide to Data

Overthenextcenturyitbecamepossibleforapersontobecountedinmanymoredifferentways.Thecategoryof“freecoloredperson”appearedin1820.Noonewasinterracial,accordingtothedata,untilthe1850censusaddedthecategoryof“mulatto.”The1890censusexpandedintoethnicityandfinershadesofblackwhenitasked“whetherwhite,black,mulatto,quadroon,octoroon,Chinese,Japanese,orIndian.”

Ofcourseyoucouldseepeopleofallthesetypesoncitystreetsbythen—butnotintheofficialstatisticsuntiltheseadditions.Categorieswerebeingaddedtobetterdescribearealitythatcouldalreadybeperceivedbyothermeans.Whichdoesn’tmakethecategoriesreality.Therewerehugenumbersofpeoplewhodidn’tfitintoanyofthesecategories,liketheIrish,whosufferedintenseracisminnineteenth-centuryAmerica.

Butalistofracesdoesn’ttellushowaperson’sracewasactuallydetermined.Inpractice,acensusenumeratorvisitedeachhomeandcheckedabox.Fordecades,enumeratorsweretoldtocountsomeoneasblackiftherewasanydegreeofblackancestry,echoingthe“onedroprule”oftheJimCrowera.Here’showracewassupposedtobequantifiedforthe1940census:

TheCuriousJournalist'sGuidetoData

18CountingRace

Page 19: Curious Journalist s Guide to Data

Instructionsforquantifyingraceandsexonthe1940census.12

It’snotclearhowcensus-takersweresupposedtodeterminesomeone’sancestrygoingbackgenerations,orhowtheyappliedthisruleinpractice,oriftheyevenreadtheinstructions—meaningthatwedon’tknowquitehowtointerprettheracialcategoriesoftheearlytwentieth-centurycensus.Ifthecollectionmethodisobscure,soisthedata.

Thenthingschanged.Inthemid-twentiethcenturytherewasahugeshiftinthewayracewascounted,butnotbecauseofsocialorphilosophicalideals.Insteadthemotivewasstatisticalaccuracy.

Closeanalysisofthe1940censusdatasuggestedthattheresultswerelowby3.6percent,meaningmillionsofpeoplehadnotbeencounted.Thecensuswassupposedtobeasimplecount,butthemassiveundercountprovedthatcountingwasanythingbutsimple.Andsome

TheCuriousJournalist'sGuidetoData

19CountingRace

Page 20: Curious Journalist s Guide to Data

peopleweremoreundercountedthanothers:13percentofnon-“white”peopleweremissingfromthecensusresults.

Therewasclearlyaracialbiasinthecensus-takingprocess.ItwassoondiscoveredthatcensusenumeratorswerehavingdifficultyidentifyingAmericanIndiansinurbanareaswheretheyweremixedinwithmajoritywhitepopulations.Thisprovedthatlookingatsomeonedidn’talwaysprovideanaccurateimpressionoftheirrace.Toaddressthis,the1960censususedadifferentapproach:Peopleweresimplyaskedwhatracetheywere.

Ifself-identificationseemstheobviouswaytodeterminerace,that’sbecausewenowunderstandraceasanentanglementofidentity,culture,andbiology,asmuchsocialasgenetic.Butthatisalatetwentieth-centuryunderstanding.Thecensusofficialsofthe1950sdonotseemtohaveunderstoodracethisway;theysimplywantedamoreaccuratecountandtookforgrantedthatapersonknowstheirownrace.

Thereissomethingaboutself-identificationthatfeelslikeastepforwardincodifyingrace,abetterwayofmakingitvisibleintheaggregate.it’samoredignifiedapproach.Butithasitsownseriouslimitations.it’snotthedatayouneedifyouwanttostudyrace-linkedgeneticdiseasesorhowpeopletreatstrangersdifferentlybasedonskincolor.Wecanthinkofraceinmanydifferentways,buttheavailabledatahasnoobligationtomatchourconceptions.Ifyouwanttoknowwhatthedatareallymeasures,theonlythingthatmattersishowitwascollected.Hence,thecensusupto1950countssomethingdifferentthanthecensusfrom1960onward,eventhoughbothcallit“race.”Howisitdifferent?Thatdependsonthequestionyouwishtoaskofthedata.

Meanwhile,HispanicshadbeguntomakeupasignificantfractionoftheU.S.population,and“Hispanic”finallyappearedoncensusformsin1970.BeforethatthecensussaidnothingabouthowmanyHispanicpeoplelivedinthecountry,wheretheylived,theirincomes,oranyoftheothervariablesnowroutinelycollected.

Thingschangedagainin1977withanewsetoffederalgovernmentguidelinesonthecollectionofracedata,theinfamous“Directive15”fromtheOfficeofManagementandBudget.Thisrecommendeddividingraceintofourcategories:“AmericanIndianorAlaskaNative,”“AsianorPacificIslander,”“Black,”and“White.”Italsosaid“itispreferabletocollectdataonraceandethnicityseparately”anddefinedethnicityas“Hispanicorigin”or“notofHispanicorigin.”ThelogichereisthatHispanicscanbeanyrace,suchasAfro-Cubans.Whichisgreat,exceptthataboutathirdofallHispanicpeopleconsider“Hispanic”tobearace,oratleasttheycheck“otherrace”ontheircensusformsandwritein“Hispanic”or

“Mexican”or“Latina.”13

ThisishowFlorida’scriminalhistorydatabasecametocodeHispanicsdifferentlythanFlorida’svoterregistrationdatabase.Thedatabaseoffelonscodedraceaccordingtofederalstandards,soracecouldonlybewhite,black,Asian,AmericanIndian,orunknown.Hispanic

TheCuriousJournalist'sGuidetoData

20CountingRace

Page 21: Curious Journalist s Guide to Data

wascodedasanethnicity,inadifferentfield.Meanwhile,thevoterregistrationdatabasecodedHispanicasarace.Asimplecomparisononthe“race”fieldfailed,becauseraceisnotasimplethingtoquantify.

Ifthefederalracialcategorizationsystemfeelsabitarbitrary,that’sbecauseitis.Evenitscreatorsknewnottotakeittooseriously,writing,“Theseclassificationsshouldnotbe

interpretedasbeingscientificoranthropologicalinnature.”14Nonetheless,allofthefederalgovernment’sracedataincludesthesefourmastercategoriestothisday.Butmanyagenciesalsocollectmoredetailedinformationonracialsub-categories.ThecensushaslongincludedagrowinglistofAsianraces,andyou’vebeenabletowriteinanyraceyouwantsince1910.

Thelastmajorchangetotheracequestionsonthecensuscamein2000.Nowyou’reallowedtocheckmultipleracesonthecensusform,inadditiontoseveralpossiblechoicesforHispanicethnicity.The2010formlookedlikethis:

TheCuriousJournalist'sGuidetoData

21CountingRace

Page 22: Curious Journalist s Guide to Data

Onthe2010census,2.9percentofthepopulationidentifiedastwoormoreraces.Thisisninemillionpeoplewhoareexpressingatypeofracialidentitywhichwasinvisiblebeforewedecidedtocountit.

TheCuriousJournalist'sGuidetoData

22CountingRace

Page 23: Curious Journalist s Guide to Data

TheProblemofWhattoCountQuantificationalwaysinvolvescomplexchoices,eveninthehardsciences.Althoughfrictionisabasicforceofclassicalphysics,itcomesfrommicro-interactionsbetweensurfacesthataren’tfullyunderstood.Ahighschoolphysicstextbookwilltellyouthatweusuallydescribeitwithtwonumbers:thecoefficientofstaticfrictionwhichishowhardyouhavetopushtostartsliding,andthecoefficientofkineticfrictionwhichishowhardyouhavetopushtokeepsliding.Butmoresophisticatedmeasurementsshowthatfrictionisactuallyquiteacomplex

force.Italsodependsonvelocity,andevenonhowfastyouwereslidingpreviously.15

Anyoneworkingwithfrictionhastochoosehowtoquantifyit.

Raceisevenmoredifficulttoquantify,asareagreatmanythingsofsocialinterest.it’sterriblyeasytoforgetthiscomplexitywhenyouarelookingatneatrowsandcolumnsofdata.

AfewyearsagoIworkedonastoryaboutgunviolence.Atthetimetherewasalotofpopulardiscussionabout“massshooting”incidents,andwhethertheywereorweren’tontherise.Butwhat’sa“massshooting”?Itseemslikeasinglemurderdoesn’tcount,sohowmanypeoplemustbekilledatoncebeforeit’s“mass”?Youhavetoanswerthisquestionbeforeyoucananswerthequestionofwhethersuchincidentsaremoreorlesscommonthanbefore.Ieventuallychosefourpeopleastheminimumthresholdforamassshooting,becausethat’swhatthedataIhadused.ThecreatorsofthatdatachosefourbecausethisishowtheFBIcounts“massmurders,”eventhoughthosearen’tquitethesamethingas“massshootings.”Respondingtotheinterestintheseevents,theFBIlaterreleaseditsowndatasetof“activeshooter”incidents,whichitdefinedas“individualsactivelyengagedinkillingorattemptingtokillpeopleinpopulatedareas(excludingshootingsrelatedtogangor

drugviolence).”v

Thisisallsomewhatarbitrary,andthereisno“right”answerhere.Whatyoushouldcountdependsonwhatyoucareabout,thatis,itdependsonthestoryyouareattemptingtotell.Andafterlookingatthedatayoumayrealizethatyouwanttocountsomethingelse.Yourinitialstorymayturnouttobeuninteresting,unfair,orjustplainwrong.

Itgetseventrickier.Imaginetrackingtheprevalenceofmentalhealthissuessuchas“depression”or“borderlinepersonalitydisorder,”whichareshortnamesforevolvingideasaboutdiseases.Thecomplexdiagnosticcriteriafortheseconditions,whichusedtobeprintedinthickhandbooks,defineaquantificationprocess.Orthinkofthepoliceofficerwhomustrecordifaparticularincidentis“sexualharassment”ornot.it’seasytoimaginethatnot

TheCuriousJournalist'sGuidetoData

23TheProblemofWhattoCount

Page 24: Curious Journalist s Guide to Data

everyofficerwillhavethesameideaofwhatsexualharassmentmeans.Thiscanmakethedatamaddeninglyhardtointerpret,nottomentionunfair.Smalldifferencesincountingtechniquecananddobecomethefocusofintensearguments.

Stillwefindsomewaytocount.Aquantificationprocessformalizestheactofcountingormeasuringorcategorizingandattemptstoapplyitconsistentlyacrossmanysituations.That’sthewholepointofstandardunitslikemetersandkilograms.Butalas,manyvitalthingsdonothavestandardmeasures.Howdowequantifymoreabstractconceptssuchas“educationalattainment”or“qualityoflife”or“intelligence”?

Inpracticeweendupreplacingsuchrichconceptswithmuchsimplerproxies.Weget“testscores”insteadof“educationalattainment”and“income”asaproxyfor“qualityoflife,”while“intelligence”istodaymeasuredbyabatteryoftestswhichassessmanydifferentcognitiveskills.Inexperimentalsciencethisiscalledoperationalizingavariable,afancynameforpickingadefinitionthat’sbothanalyticallyusefulandpracticalenoughtocreatedata.

Ifyouwanttoaskaquestionthatonlyquantitativemethodscananswer,youhavelittlechoicebuttomakethisswitchfromrichconceptiontorepeatablemeasurement.Butquantificationcanalsoforceyoutobeclear.Tryingtoquantifymightleadyoutodiscoverthatyou’vebeenusingcertainwordsforalongtimewithoutreallyunderstandingwhattheymean—doyoureallyknowwhat“intelligence”is?Eventuallyaquantificationofathingcanbecomethedefinition,astheIQtestdid.Thismightbeaclarifyingimprovement,oranarrowingofperception,orboth.Inanycase,itisachoicethatshouldbemadeconsciously.

Usuallythereissomeendgoal,somepurposetocollectingdata,andyoucanaskwhetheranyparticularquantificationmethodservesthatpurpose.Andyoucanaskabouttheendpurpose,too,theframeoftheentirething.Differentquantificationmethodsservedifferentstories.

TheCuriousJournalist'sGuidetoData

24TheProblemofWhattoCount

Page 25: Curious Journalist s Guide to Data

SamplingandQuantifiedErrorYoushouldbeskepticalofanyheadlinethatsaysthenumberofjobsintheUnitedStateshaschangedbyfewerthanabout105,000sincelastmonth.That’sbecausethemonthly

jobsgrowthestimatehasamarginoferrorofaboutplusorminus105,000.16

TheNewYorkTimesmadethispointwithaninteractivegraphic,showinghowtheuncertaintyinemploymentfigurescanbadlymisleadus.

FromTheNewYorkTimes,2014.17

Here,jobgrowthwasconsistentat150,000newjobseachmonth,butthereleasedfiguresshowanupwardtrendjustbychance.TheunemploymentratecalculatedbytheBureauofLaborStatisticsincludesafairamountoferrorduetorandomsampling,upto105,000jobsaboveorbelowtherealvalue.Pressing“play”animatestherighthandchartthroughendlesspossiblescenarioswiththesamerangeoferror.Ifyouwaitforaminuteyoucanseecases

TheCuriousJournalist'sGuidetoData

25SamplingandQuantifiedError

Page 26: Curious Journalist s Guide to Data

wherejobgrowthappearstohaveanytrendyoulike.Becauseoftheserandomerrors,monthlychangestypicallymeanlessthanwethinktheydo.Long-termtrendsaremuchmorereliable.

Politicalpollsalsohavebuilt-inerror.Ifonecandidateisaheadoftheother47percentto45percent,butthemarginoferroris5percent,thereisaprettygoodchancethatanotheridenticalpollwillshowthecandidatestheotherwayaround.Prettymuchanysortofpublicsurveywillhaveintrinsicerror,andareputablesourcewillreportthemarginoferroralongwiththeresults.Theerrorofameasurementisanecessarypartofunderstandingwhatthatmeasurementmeans.

Maybeyou’veseenformulasforcalculatingthemarginoferrorforarandomsample,butratherthanrepeatthoseequationsIwanttogiveasenseofwhyweuserandomsamplingatallandhowitleadstoquantifiederror.Expressinghowmucherrorthereismayseemobviousnow,butitwasakeyinnovationinthehistoryofstatistics.ThereisarandomsampleintheOldTestament:“Thepeoplecastlotstobringoneoutofeverytenofthemto

liveinJerusalem.”viItcouldn'thavebeenlongbeforesomeonethoughtofcountingbylettingeachofthechosenstandfor10,butmillenniapassedbeforeanyonewasabletoestimatetheaccuracyofthisprocess.

Samplingisbasicallyalabor-savingdevice.Theunemploymentfiguresneedtocomeouteverymonth,butnobodyisgoingtoknockonyourdoor12timesayeartoaskifyouhaveajob.Insteadtheunemploymentrateiscalculatedfromtheanswerstotwosurveys:theCurrentEstablishmentSurveywhichsamplesbusinesses,andtheCurrentPopulation

Surveywhichsampleshouseholds.vii150,000randomlychosenpeopleeachmonth,viii

eventuallyassignedtooneofthreecategories:“employed,”“unemployed,”or“notinthe

laborforce.”18Thefractionof“unemployed”peopleamongthoseaskedthenstandsinforthefractionofunemployedpeopleinthewholecountry.

Ifthisdoesn’tstrikeyouasaudacious,you’veprobablyneverthoughtaboutjustwhatapollclaimstobeabletodo.Extrapolatingfrom150,000peopleto300,000,000peoplemeanscollectinginformationfromonepersonin2,000thensayingitspeaksfortheother1,999.it’slikeaskingonlyonepersonineachneighborhoodwhetherheorsheisemployed.

Randomnessisthekeytothis,becauseitmakesover-representationbyanyonegroupextremelyunlikely.it’spossiblethatallthepeoplewhoanswerarandomtelephonepollmightbeunemployedjustbychance,givingusabadestimate.Butthatwillhappenrarely—essentiallyneverinpractice—andhowelseshouldwepickpeople?Wecouldcountthroughconsecutivephonenumbersinstead,butthatmightonlygetusanswersfromacertainarea.Orwecouldjustgothroughourowncontactlists,butthatseemsevenlessrepresentative.Randomnessisnotsubjecttoselectionbiaspreciselybecauseithasnorelationtoanythingelse.Evenbetter,althoughanygivensamplewillgiveusanestimatethatisoffbysome

TheCuriousJournalist'sGuidetoData

26SamplingandQuantifiedError

Page 27: Curious Journalist s Guide to Data

amount,themostcommonvalueisgoingtobethetruevalue.Also,it’srandomnessthatallowsustoreasonaboutwhattheerroris.Insteadofreasoningabouttheerrorofasinglesurvey,whichisunknowable,wecanreasonabouttheerrorofthesamplingprocessacrossmanydifferentsurveys.Thisisakintosayingthatwecan’tknowwhatthenextrollofthediewillbe,butthereisaone-sixthchanceitwillbeafive.

Let’smaketheproblemalittlesimplerandimaginethatthereareonly50peopleinthewholecountry,andyou’vecomputedtheunemploymentratebysamplingfiveofthem.Youcouldhaveendedupwithmanydifferentsetsoffivepeopleinyoursamplehadchancetakenadifferentcourse,butthereareafinitenumberofpossibilities.Herearesomeofthem,andthedifferentunemploymentrateestimatesthateachonewouldgiveyou:

Youcanimaginedrawingapictureofeverypossiblesetofnamesoutof50.You’llendup

with“50choose5”differentsamplingpatterns,anumberwhichisusuallywritten .Youcangetanactualnumberforthisusingthe“choose”or“combinations”functionofascientificcalculatororprogramminglanguage,andit’s2,118,760,overtwomillion.Thereareanawfullotofwaystopickfiverandomthingsoutof50possiblethings,andahugelylargernumberofwaystopick150,000peopleoutof300,0000,000,butwecancountwithsimpleformulaseitherway.

TheCuriousJournalist'sGuidetoData

27SamplingandQuantifiedError

Page 28: Curious Journalist s Guide to Data

Wecangroupallofthesesamplingpatternsintosixpiles,accordingtohowmanypeopleineachsampleturnedupunemployed,zerotofive.Thisgroupsouranswersintounemploymentratesof0/5,1/5,2/5,3/5,4/5,and5/5,whichisthesameas0%,20%,40%,60%,80%,and100%unemployment.Becauseeachpossiblesample—eachsetoffivenames—isequallylikely,thesizeofeachpiletellsyouyourchancesofgettingafinalestimatewiththatnumberofunemployedpeople.Thisisthekeyinsightthatwillallowustoquantifyhowoftenweexpectourunemploymentestimatetobewrong,andbyhowmuch.

Youdon’tactuallyneedstacksofdrawingstocalculatetheerrorofanunemploymentestimate,becausewecandirectlycalculatethenumberofsamplesofeachkind.Forexample,wecanworkouthowmanysamplesincludeexactlyoneunemployedperson.Herethereare50people,20ofwhomareunemployed.Thetotalnumberofwaystochoosefivepeoplefrom50sothatexactlyoneturnsupunemployedisequaltothenumberofwaystopickoneunemployedpersonfrom20,timesthenumberofwaystopickfourunemployedpeopleoutof30.

Thisiswritten usingthestandardnotationfor“choose.”Somereaderswill

recognizeasimilarixterminthebinomialdistributionfunctionB(50,0.4),theformuladevelopedbyBernoullisometimeinthe1680s.

Thisformulamakesitpossibletotallythenumberofwaystogetasamplewithanyparticularnumberofunemployedpeople.Dividingthenumberofpossiblesamplesforeachlevelofunemploymentbythetotalof2,118,760possiblesamplesgivestheprobabilityofseeingeachpossibleunemploymentestimate.

EstimatedUnemployment No.Samples ProbabilityofGettingThisAnswer

0% 142,506 0.07

20% 548,100 0.26

40% 771,400 0.36

60% 495,900 0.23

80% 145,350 0.07

100% 15,504 0.01

Tomakethiseasiertoseewecanplotthefigureslikeso:

TheCuriousJournalist'sGuidetoData

28SamplingandQuantifiedError

Page 29: Curious Journalist s Guide to Data

Thischartshowsasamplingdistribution,meaningthatwewouldexpecttoseeeachanswerintheseproportionsifwerepeatedtherandomsamplingprocessmanytimes.Aswehadhoped,answersclosertothetruthoccurmoreoftenthanthosefurtheraway,andthemostcommonansweristhecorrectone.There’saprobabilityof0.36,ora36percentchance,thatwe’llendupwithexactlytherightanswerfromourlittlesurvey.

Thisdistributiontellsuseverythingwecanknowaboutthepossibleerrorinoursamplevalue.Butwe’lloftenwantamoreunderstandablesummary,andonewayofsummarizinganerrordistributionistosayhowoftenwe’llgetwithinacertaindistanceofthecorrectanswer.Let’ssaywewanttoknowhowoftenwecanexpecttogeteitherthetrueanswerof40%,ortheclosestincorrectanswersof20%and60%.Thisrequiresaddinguptheprobabilitiesthatweget20%,40%,or60%,whichcorrespondstoseeingone,two,orthreeunemployedpeopleoursample.There’saprobabilityof0.26+0.36+0.23=0.85thatwe’llseeanyofthesethreeanswers.

Amongthe2,118,760differentsamplesoffivethatwecoulddrawfromourpopulationof50people,wefindthat1,815,400or85percentofthemcontainone,two,orthreeunemployedpeople.Putanotherway,85percentofallsamplescontainbetween20%and60%

unemployed.xisknownasan85-percentconfidenceinterval.Becausethisintervalcoversa40%range,andourbestestimateisrightinthemiddle,wesaythattheestimatehasamarginoferrorof20%.Themarginoferrorisalwayshalfofthewidthoftheconfidenceinterval.

TheCuriousJournalist'sGuidetoData

29SamplingandQuantifiedError

Page 30: Curious Journalist s Guide to Data

Weneedonemorestep.Sofarwe’vebeentalkingaboutthepossiblesampleswemightgetforagiventrueunemploymentrateof40%,andhowoftenwe’llendupwitheachestimatednumber.Inrealitywenevergettoknowthetrueunemploymentrate!Weonlyevergetonesample,andthisgivesusonlyasingleerror-proneestimate.Insteadof“howoftenistheestimatewithinthemarginoferrorofthetruevalue,”thequestionwereallyneedtoaskis“howoftenwillthetruevaluebewithinthemarginoferroroftheestimate?”

Todothis,westartwiththeestimatedunemploymentrate,thatis,therateofunemploymentintheactualsamplewehave.Weassumethatthisisthetruerateandconstructamarginoferrorusingtheprocessabove.Iftheestimateiswithin20%ofthetruevalue,thenitfollowsthatthetruevalueiswithin20%oftheestimate.Thisisn’tperfectlyaccurate,becausethemarginoferrorvariesinwidthdependingonthetruevalue,soourestimatedmarginoferrorwon’tbequiterightiftheestimateisn’tquiteright.Youcanworkoutmorepreciseformulas,butthissimplemethodofsubstitutingtheestimateforthetruevaluegivesacloseapproximationforpracticalsurveysizes,andit’swidelyusedinpractice.

Andthat’sit.We’venowcalculatedthemarginoferroronourunemploymentestimate.Theremanydifferentwaysofphrasingourresult,whichallmeanthesamething.

The85-percentconfidenceintervalis20%to60%

Theansweris40%withamarginoferrorof20%,17timesoutof20.

Weare85percentcertainthatthetrueanswerisbetween20%and60%

TheCuriousJournalist'sGuidetoData

30SamplingandQuantifiedError

Page 31: Curious Journalist s Guide to Data

Theansweris40%±20%at85percentconfidence.

Noticethatwealwaysusetwovaluestomeasuretheuncertainty:amarginoferrorand

theprobabilitythatthetrueanswerfallswithinthatmarginoferror.xioferror,inthiscase20%to60%,iscalledthe85-percentconfidenceinterval.The85percentfigureitselfiscalledtheconfidencelevel.Whateverlanguageweuse,wehavequantifiedtheerrorinoursurveyintwovalues:arangeoferrorandhowoftenyou’llseethatsomethingwithinthatrange.

If40%±20%atan85-percentconfidencelevelisapreciseenoughanswer,you’vereducedyourworkbyafactorof10byaskingonlyfiveoutof50people.Ifit’snotpreciseenough,youcansamplemorepeople.Tocomparetheerrordistributionsofdifferentnumbersofsamples,ithelpstoholdtheconfidencelevelconstant.TheBureauofLaborStatisticsreportsthemarginoferroronunemploymentfiguresatthe90-percentlevel,sowewilltoo.We’llalsodothecalculationsasifwe’resamplingfromarealcountry’spopulation,whichismuchlargerthan50.

Theaccuracygetsbetterasyouaskmorepeople.Asthenumberofsamplesgetslarger—we’reupto100inthelastpictureabove—themarginoferrorgetsnarrower(foraparticularconfidencelevel)andthedistributionofpossibleanswersrapidlyapproachestheclassicbell-shapedcurve,thenormaldistribution.Evenbetter,forlargesamplestheerrorcausedbysamplingdependsprimarilythesamplesize,notthepopulationsize.Thismeansthat

TheCuriousJournalist'sGuidetoData

31SamplingandQuantifiedError

Page 32: Curious Journalist s Guide to Data

estimatingtheopinionsofahundredmillionpeopletakesbarelymoreworkthanestimatingtheopinionsofonemillion.Bythetimeyousurvey1,000people,themarginoferrorisdownto3%atthe90-percentconfidencelevel.

Thisishowweknowtheerrorinourmonthlyunemploymentestimates.TheCurrentPopulationSurveysamples150,000peopleoutof300,000,000.TheBureauofLaborstatisticshasrunthemathandworkedoutthatit’llgetwithin300,000ofthetrueunemploymentrate90percentofthetime,whichcorrespondsto0.2%differenceinthe

nationalunemploymentrate.19The300,000isthemarginoferrorandthe90percentistheconfidencelevel.

Ifa90-percentconfidenceintervalsoundslikea10percentchanceofdisaster,wecantradeoffbetweentheestimatederrorandtheriskoffallingoutsideofthaterror:it’sequallytruetosaythat99percentofthetimetheunemploymentfigureswillbeaccuratetowithin±0.3%.Thisisthesamething,reporteddifferently;we’rejustwideningtheredlineontheabovechartsuntilitcovers99percentofthepossibleoutcomes.

Thereisanintricatebargainbeingstruckhere.Inexchangeforalittlefuzziness(themarginoferror)andalittlerisk(theconfidencelevel)we’vereducedourworktocalculatetheunemploymentrateby2,000times.Thisremainsastonishingtome.it’sbeautifulandnon-obviousandtookmillenniaforhumanitytoseeit.

TheCuriousJournalist'sGuidetoData

32SamplingandQuantifiedError

Page 33: Curious Journalist s Guide to Data

TheProblemofMeasurementErrorInpractice,nothingcanbemeasuredperfectly.

Arandomsamplehasamarginoferrorduetosampling,buteveryquantificationhaserrorforonereasonoranother.Thelengthofatablecannotbemeasuredmuchfinerthanthetickmarksonwhateverruleryouuse,andtheruleritselfwascreatedwithfiniteprecision.Everyphysicalsensorhasnoise,limitedresolution,calibrationproblems,andotherunaccountedvariations.Humansarenevercompletelyconsistentintheircategorizations,andtheworldisfilledwithspecialcases.AndI’veneverseenadatabasethatdidn’thaveacertainfractionofcorruptedormissingorsimplynonsensicalentries,theresultofglitchesinincreasinglycomplexdata-generationworkflows.

Errorcreepsin,andthedataneverquitematchesthedescriptiononthebox.Anyonewhoworkswithdatahashadthisbeatenintothembyexperience.

Evensimplecountsbreakdownwhenyouhavetocountalotofthings.We’veallsensedthatlargepopulationfiguresaresomewhatfictitious.Aretherereally536,348peopleinyourhometown,asthenumberonthe“WelcomeTo…”signsuggests?Ifthesignsaid540,000,wewouldknowtotreatitasaroughfigure,yetfartoooftenwe’rewillingtoimaginethateverylastdigitisaccurate.

Thereareanalogousdifficultieswithcountingthenumberofpeopleataprotest,thenumberofintravenousdrugusersinacity,orthenumberofstarsinthegalaxy.Evencountingthenumberofdistinctnamesinalargedatabasecanrequirecomplexestimationalgorithms,

giventheconstraintsofdistributedstorageandfinitememory.20Largecountsareusuallyestimates,whichdifferfromthetruevaluebysomeamount.

Butwegainhugelyifwecansaysomethingabouttheaccuracyofourdata.Ouranswerto“howlongisthetable?”mightbe“52inches,tothenearesteighthofaninch.”

Reliabledataincludesmeasuresoferror:howmuchthereportedinformationisexpectedtodifferfromtherealityitrepresents.Therearemanystandardwaystoreporttheaccuracyofdifferentkindsofdata.Figuresmightbe“accuratetothenearestquarterpound”orusemoretechnicalnotationlike±andideaslike“standarderror”and“confidenceinterval.”Foralargedatabaseyoucouldreportorestimatethenumberofbadentries.Themoderncensushasasecondwavetoestimatecoverageandthereforeerror.Inmanyfieldsit’sconsideredshoddyworktoreportafigurewithoutgivingsomeideaoftheaccuracy.Maybeweshouldsaythesameforjournalism.

TheCuriousJournalist'sGuidetoData

33TheProblemofMeasurementError

Page 34: Curious Journalist s Guide to Data

Theideaofmeasurementerroristheideaofquantifieduncertainty.Thisisoneofthetremendousachievementsofmodernthought—therecognitionthatknowinghowmuchwedon’tknowhasgreatvalue.Notalldatacomeswithmeasurementerrorsattached.Sometimesyouhavetoreadthefineprinttofindout,orcallsomeoneandask.Butifyoudonotknowandcannotreasonablyguessthesourcesandmagnitudesofpossibleerror,thenyoudon’treallyknowwhatthedatameans.

TheCuriousJournalist'sGuidetoData

34TheProblemofMeasurementError

Page 35: Curious Journalist s Guide to Data

QuantificationIsRepresentationTheworldisveryrichandcomplex.Doesn’ttryingtoreduceittodatalosesomethingvital?Ofcourse!

Allquantificationthrowsoutinformation.Ithasto.That’sthepointofabstraction:tostripawayenoughdetailthatit’spossibletousepowerfulgeneral-purposereasoningtools.Mostthingsarethrownoutwhenyougofromthreeactualapplesto“threeapples”recordedinadatabase.Wedon’tknowanythingaboutthecolorandsizeoftheapples,orwhytheyarethere,andmaybeoneofthemishalfrotten.Ifwechoose“apple”asoursoleunitofsymbolicrepresentation,wewillbeblindtoeverythingelse.

Butinjournalismwethrowoutinformationallthetimewhenweselectwhomwetalkto,whatweincludeandexcludeinourstory,andwhatwechoosetowriteaboutatall.Quantificationrepresentstheworldthroughthesystematiccreationofdata,alimitedbutpowerfulwaytogatherandsummarizeinformation.

Fortunately,quantificationisneithermysteriousnorfixedbynature.Quantificationisalwaysadesignedprocess.Ifthereissomereasonablewaytoquantifywhatwecareabout,amarvelousuniverseofanalysis,representation,andpredictiontechniquesopenuptous.

Countingislimited,buttherearemanythingsthatarebestknownbycounting.

TheCuriousJournalist'sGuidetoData

35QuantificationIsRepresentation

Page 36: Curious Journalist s Guide to Data

AnalysisItmaywellbethatseveralexplanationsremain,inwhichcaseonetriestestaftertestuntil

oneorotherofthemhasaconvincingamountofsupport.-SherlockHolmes21

It’sbeensaidthatdataspeaksforitself.Thisisnonsense.

It’struethatgoingandlookingusuallybeatssittingandthinking.That’sthecoreideaofempiricismandthepointofcollectingdata.Andit’struethatdatacanberevealingandinsightful.Sometimesyoulookatagraphandsay“aha!”andfeelyouunderstandtheworldalittlebetter.Inthatmomentthereisthesensationthatthedataisspeaking,thatittellsaclearstory.

Butthedatadidn’ttellastory,youdid.Yousawastorythatconnectsthedatatotheworld.Areyouright?Ideally,yourstoryisthoughtfullycorroboratedbymanysources.Butifyou’regoingtousedataasevidence,youhavetounderstandwhatitdoesanddoesn’tsay.

Thischapterisabouthowtodrawtruemeaningsfromtruedata.Therearemathematicalruleswhichsaythattwoplustwoneverequalsfive.Thereareformulasthatencapsulatethelogicofworkingwithchanceandcause.Therearebasicprinciplesofinvestigation,suchastestingyourguesses.Andtherearefundamentallimitationstoknowledge,thecaseswherewemustadmitwecan’tknowtheanswer,atleastnotwiththedatawehave.

Thisdoesn’tmeanthere’sasinglerightanswerineverycase.Alldataanalysisisreallydatainterpretation,andreliesoncombiningdatawithsomethingelse,suchaspreviouslyknownfactsorculturalknowledge.Data,onitsown,hasnomeaningatall.Imagineaspreadsheetwithnocolumnnames.Itwouldjustbenumbers,indecipherableanduseless.

TheCuriousJournalist'sGuidetoData

36Analysis

Page 37: Curious Journalist s Guide to Data

Thenecessarycontextentersinmanydifferentways.Datacan’tbeunderstoodwithoutknowledgeofthequantificationprocessthatcreatedit.Statisticalworkusuallyrequiresassumptionstiedtocommonknowledge:totalkaleconsumptioncan’tbemorethanasmallfractionoftotalfoodconsumption,andlowercancerratesarebetter.Butthecultureandthejournalistarealsopartofthecontextthatcreatesmeaning.Everysocietyhasparticularworriesthatshapewhatisnewsworthy,whileindividualjournalistshavespecificbeatsandinterests.Actuallythecontextcomesbeforethedata;ittellsuswhatdataisrelevant,evenwhatquestionsarerelevant.

Contextiswheresubjectivityentersintodatainterpretation.TheNewYorkTimesillustratedthiswithtwodifferentinterpretationsofthesameunemploymentdata,describinghowaDemocratandaRepublicanmightseethings.

TheCuriousJournalist'sGuidetoData

37Analysis

Page 38: Curious Journalist s Guide to Data

HowDemocratsandRepublicansmightinterpretthesameunemploymentdataindifferent

ways.22

Butit’snotjustpoliticianswhohavedifferentperspectives.Journalistscananddodisagreeontheinterpretationofasinglenumber.

HeadlinesonOctober22,2013.23

Bothheadlinesareperfectlytrue.Thedifferencebetweenthemisdowntowhether148,000merits“only”—isitabigorasmallnumber?Thiscouldalsobeamatterofexpectations:perhapsTheWallStreetJournalwashopingtoseealargerincreaseinjobs.

TheCuriousJournalist'sGuidetoData

38Analysis

Page 39: Curious Journalist s Guide to Data

Thissubjectivitymayseemdisheartening.Inthesciences“subjective”issometimesusedasaninsult.Subjectivethingsarepersonal,dependentonwhoisspeaking,maybeamatteroftaste.Wasn’tdatasupposedtobeobjective?Wasn’titsupposedtoavoidthearbitrarinessofopinionandbringusclosertothetruth?

Datainterpretationmaynotbemathematicallogic,butnetherisitnihilist.Ourinterpretationsmustbefaithfultoreality.Outthereintheworldapolicychangedcrimerates,oritdidn’t.Thewagegapissomespecificlevelandnoother.Carefulmeasurementsshowclimatechangeisdrivenbyhumanactivitythroughparticularmechanisms,ortheydon’t.Allofthesearequantitativestatementsthatinvolvequantificationchoices—sometimescontroversialchoices.Butonceyoupickacountingmethod,realitywillseethatyouendupwithaparticularnumber,whichisofcoursethepointofcounting.Justlikeascientist,ajournalistcan’tmakeupdata,ignoreevidence,orcondonelogicalfallacies.it’sequallyimportanttoknowwhenyoudon’tknow,whenyoucan’tanswerthequestionfromavailabledata.

Yettheconstraintsoftruthleaveaverywidespaceforinterpretation.Therearemanystoriesyoucouldwritefromthesamesetoffacts,oryoucoulddecidethatentirelydifferentfactsarerelevant.Subjectivityisatthecoreofjournalism,becausethereisnoobjectivetheorythattellsuswhichtruestoriesarethebest.But“subjective”doesn’tnecessarilymean“personal.”Cultureiswidelysharedandpeopleliveinnetworks,andjournalismrequiresabroaddoseofsocietalknowledge.Journalistsespeciallyneedtounderstandthecommonknowledgeandvaluesoftheaudience—evenifjusttochallengethem.Thataudienceisneveruniform,anddifferentpeoplewillhavedifferentconcerns,experiences,andperspectives.Everytimeyouaskyourself“whatisthestoryhere?”youarebringingtheaudienceintoyourwork.

Findingastoryinthedatawillalwaysbeanactofculturalcreation.Butthosestoriesmuststillbetrue!Sotherestofthischapterisanintroductiontothreebigideasthatcanhelpdrawtruthfromdata.Thefirstistheeffectofchance,randomness,ornoise,whichcanobscuretherealrelationbetweenvariablesorcreatetheappearanceofaconnectionwherenoneexists.Thesecondisthenatureofcause,andthesituationswherewecanandcan’tascribecausefromthedata.Aboveallistheideaofconsideringmultipleexplanationsforthesamedata,ratherthanjustacceptingthefirstexplanationthatmakessense.

Mygoalistogiveyouthehigher-levellogicofthewholeprocessofstatisticalanalysis.Foranyparticularproblemyouwillneedspecifictechnicaltools,butthosechoicesmustbeguidedbyalargerframework.

TheCuriousJournalist'sGuidetoData

39Analysis

Page 40: Curious Journalist s Guide to Data

DidthePolicyWork?In2008theAustraliancityofNewSouthWaleshadhadenoughofdrunkenassaults.Thecourtsimposedanearlierclosingtimeonbarsinthecentralbusinessdistrict:Noalcoholafter3a.m.Now,18monthslater,youhavebeenaskedtowriteastoryaboutwhetherornotthispolicychangeworked.Here’sthedata:

Numberofnighttimeassaultsrecordedbypoliceineachquarterinthecentralbusinessdistrict(CBD)ofNewSouthWales,

whereclosingtimewasrestrictedto3a.m.AdaptedfromKypri,Jones,McElduffandBarker,2010.24

Ourveryfirstquestionshavetobeaboutthesourceofthedata,thequantificationprocess.Whorecordedthisandhow?Ofcoursethepoliceknewthattherewasanewclosingtimebeingtested—didthisinfluencethemtocountdifferently?Evenatruereductioninassaultsdoesn’tnecessarilymeanthisisagoodpolicy.Maybetherewasanotherwaytoreduceviolencewithoutcuttingtheeveningshort,ormaybetherewasawaytoreduceviolencemuchmore.

Thefirststepindataanalysisisseeingtheframe:theassumptionsabouthowthedatawascollectedandwhatitmeans.

Butlet’sassumeallofthosequestionshavebeenasked,andwe’redowntothequestionofwhetherthepolicycausedadropinassaults.Inprinciple,thereisacorrectanswer.Outthere,intheworld,theearlierclosingtimehadsomeeffectonthenumberofnighttimeassaults,somethingbetween“nothingatall”toperhaps“reducedbyhalf.”Ourtaskistoestimatethiseffectquantitativelyaspreciselyaspossible(andnomorepreciselythanthat).

TheCuriousJournalist'sGuidetoData

40DidthePolicyWork?

Page 41: Curious Journalist s Guide to Data

Thisdataisaboutasclearasyou’reeverlikelytoseeoutsideofatextbook.Wehaveaboutsevenyearsofquarterlydataforthenumberofnighttimeassaultsinthecentraldistrictbeforethenewclosingtimewentintoeffect,and18monthsofdataafter.Afterthepolicychangetheaveragenumberofincidentsisalotlower,adropfromsomethinglike100-ishperquarterto60-ishperquarter.

Sothepolicyseemstohaveworked.Butlet’sspelloutthelogicofwhatwe’resayinghere.Ifyoucan’texpressthecoreofyouranalysisinplain,non-technicallanguage,youprobablydon’tunderstandwhatyou’redoing.Ourargumentis:

1. Therangeofthenumberofincidentsdecreasedinearly2008.

2. Theearlierclosingtimewentintoeffectaroundthesametime.

3. Therefore,theearlierclosingtimecausedthenumberofincidentstodecrease.

Areweright?There’snonecessaryreasonthatthedropinassaultswascausedbytheearlierclosingtime.Theevidencewehaveiscircumstantial,andanyotherstorywecouldmakeuptoexplainthedatamightturnouttobetrue.That’sthecoremessageofthischapter,andthekeyskillinbeingright:Considerotherexplanations.

Therearecommonalternativeexplanationsthatarealwaysworthconsidering.

First,chance.Sheerluckcouldbefoolingus.Theactualnumberofassaultsperquarterisshapedbycircumstantialfactorsthatwecan’thopetoknow.Whocansaywhysomeonethrewapunch,ordidn’t?Andwehaveonlysixdatapointsfromafterthenewpolicywentintoeffect—couldwejustbeseeingaluckyrollofthedie?

Second,correlation.Thedecreasecouldberelatedtotheearlierclosingtimewithoutbeingcausedbyit.Perhapsthepolicesteppeduppatrolstoenforcethenewlaw,andit’sthisincreasedpresencethatisreducingcrime,notthenewclosingtimeitself.

Third,everythingelse.Thechangecouldbecausedbysomethingthathasneveroccurredtous.Maybetherewasachangeinsomeothersortofpolicythathasalargeeffectonnightlife.Maybecrimewasfallingalloverthecountryatthesametime.

We’lltackletheseoneatatime.Togetthere,weneedtotourthroughsomeofthemostfundamentalandprofoundideasofstatisticalanalysis.

TheCuriousJournalist'sGuidetoData

41DidthePolicyWork?

Page 42: Curious Journalist s Guide to Data

AccountingforChanceit’sverytemptingtointerpretsomethingasmeaningfulwhenitcouldjustaseasilybeacoincidence—especiallyifitmakesagoodstory.Butdumbluckisalwaysintherunningasanexplanationforyourdata.Totrytountanglechancefromotherfactors,wecanestimatetheprobabilityofsheercoincidence.

Ournighttimeassaultsdatashowsgenerousvariation.Beforethechangeinclosinghoursthenumberofassaultsrangedfrom60-ishto130-ish.Wesaythisvariationisrandom,meaningthatwecan’teverhopetoknowthecircumstancesthatcauseaparticularfighton

aparticularnight,anditispreciselythisrandomnessthatcomplicatesouranalysis.xiiThelessdatayouhave,themorechanceisafactorandtheeasieritistobefooled.Supposeweonlyhadtwoquartersofdataafterthechange:

Numberofnighttimeassaults,withonlytwodatapointsafterclosingtimewasrestrictedto3a.m.AdaptedfromKypriet

al.25

Ifyoulookedatjustthisdata,youmightconcludethatthenewclosingtimehadnoeffect.Thenewpointsareprettymuchinlinewiththedatafromthepreviousfourquarters.Ifanything,itlooksliketherewasadownwardshiftinthenumberofassaultsayearbeforethepolicyeverwentintoeffect!Buthavingseentheadditionaldata,weknowthatthetwopointshereareatthehighendofanewlowerrange.it’sjustchancethatmakesthistruncateddatalooklikenothinghappened.

TheCuriousJournalist'sGuidetoData

42AccountingforChance

Page 43: Curious Journalist s Guide to Data

Ifwecanbefooledbytwochancedatapoints,canwebefooledbysix?Certainly,butlessprobably.Howmuchless?

Ittakesawhiletobuildupanintuitionabouttheeffectsofchance.Fromworkingwithdataandmodels,youeventuallygetasenseofwhatrandomnesslookslike,andthereforewhatitdoesn’tlooklikeandhowmuchdatayouneedtofeelsureaboutyourconclusions.it’swellworthgettingthissenseinyourbones.Butthegreatadvantageofstatisticaltheoryistheabilitytoquantifychance.“Whataretheoddsthatit’sjustacoincidence?”isnotarhetoricalquestion.Itasksforanumericanswer.

TheCuriousJournalist'sGuidetoData

43AccountingforChance

Page 44: Curious Journalist s Guide to Data

CountingPossibleWorldsYouprobablyusewordslike“odds,”“chance,”“frequency,”and“probability”allthetimetorefertouncertainevents.Butbeforewecangoanyfurtherweneedtogetpreciseaboutwhatthesewordsmean.Youhavetogetthebasicsrightorsmartpeopleinyouraudiencewillmakefunofyou,andbesidesyouwon’tbeabletocalculateanythingcorrectly.

Thesesimpleideasarenolessprofoundforbeingoldandreallyonlyemergedinthelate

1600s.xiiiEvenifyou’vebeenthroughthisbefore,perhapsIcanofferanewperspective.Statisticscountspossibleworlds.

Probabilityisawayofreasoningabouteventsthatwecan’tobserve.Maybewecan’tseewhat’shappeningbecauseofpracticalproblems:what’sthetemperatureatthecenterofthesun?Butquitecommonly,wewilluseprobabilitytotalkaboutpotentialworlds:whatwould

happenifwechoosethispolicy?xivThecentralinsightofprobabilityisthatinmanyofthesesituationsyouknowmorethannothing.

Perhapsyoudon’tdon’tknowwhatthenextrollofthediewillbe,butyoudoknowthatallpossibilitieswilloccurinequalproportions.Oryoumightknowthatyourfriendusuallyordersablueberrycheesecakeatyourweeklydinnerdate,andlesscommonlythelemontart.Youcanusenumberstoexpresstheseideas.Aprobabilityof0means“impossible”whileaprobabilityof1means“certain,”andallprobabilitieshavetoaddup1.

Probabilitiesarelikeapercentageinthattheyareproportions,notcounts,andwhensomeonesays“percentagechance”theyusuallymeanprobabilitytimes100.Butit’softenmoreintuitivetothinkaboutprobabilitiesasfrequencies,actualcountsofdifferentoutcomes.Supposethatoverthenextfivedinnerswithyourfriendyouwouldexpecthertoordertwoblueberrycheesecakesandthreelemontarts.Thishasn’tactuallyhappenedyetsowe’renotcountingactualdeserts,butratherthedesertsweexpect;probabilityisalanguagefortalkingaboutouruncertainty.

Thecountsherearefrequencies.Probabilitiesarejusttheratioofonetypeofeventtoallevents.

TheCuriousJournalist'sGuidetoData

44CountingPossibleWorlds

Page 45: Curious Journalist s Guide to Data

Theprobabilitythatsomethinghappensisusuallywrittenp(something).Inthiscasep(cake)=0.4,butlikeavariableinanequation,youmayormaynotknowthevalueofyourp(something).Itmaystandinforanumberthatsomeonehaspreviouslymeasuredorcomputed,oritmaybewhatyou’retryingtoworkout.

Theoddsareaslightlydifferentwayoftalkingaboutthesameproportion.

Theoddsaredefinedasthenumberofeventswearecountingdividedbythenumberwearenotcounting.Ingamblingtheoddsarethenumberoftimesyouwindividedbythenumberoftimesyoudon’t.Theoddsofcakehereare2/3or0.66,butweusuallyreportoddsbygivingthenumeratorandthedenominatorseparately:theoddsare2to3.Youcanconvertoddstoprobabilitybydividingthefirstnumberbythesumofthetwo:2to3oddsisaprobabilityof2/(2+3).Oddsof1to1meanaprobabilityof1/(1+1)=1/2,ora50/50chance.

Although“odds”and“probability”arebothnumericmeasurementsofchance,theyaredifferentformulasandifyouconfusethemyouwillgetthewronganswer.Don’tbethatjournalist.(You’realsowelcometocorrectpeoplewhentheyusethewrongwords,butremember:pedantsdiealone.)

Wecandosomeniftythingswithsimpleprobabilities.Howmanycakesdoyouexpectyourfriendtoorderoverthenext20dinners?Thisisjustp(cake)×20=0.4×20=8.Youcanthinkof0.4astheaveragenumberofcakessheordersperdinner.Ofcoursethereisrandomnesshere;sheactuallyorderseitherzerooronecakeseachtime,andoverthecourseof20dinnersshemightorder7or9or17cakes,but8willbethemostcommonnumber.(Becausetherearetwopossibledesertchoices,yougetabinomialdistributionjustlikethesamplingdistributionfromthelastchapter.)

TheCuriousJournalist'sGuidetoData

45CountingPossibleWorlds

Page 46: Curious Journalist s Guide to Data

Quiteoftenwewillneedtocounthowfrequentlymultipleeventsoccurtogether.Whatistheprobabilitythatyourfriendorderscheesecakeatthenexttwodinners?Let’sdraweverypossiblecombinationofherfirstandseconddesertorders.

Forherfirstdinnersheorderscake2outof5times.Aftereachofthose,sheorderscakeagain2outof5times.Hencethereare2×2=4possibleworldswhereyougettwocakeordersinarow.Sincethereare25possibilitiesintotal,theprobabilityis4/25or0.16.

Or,wecouldjustmultiplyp(cake)×p(cake)=0.4×0.4=0.16.Thedefinitionofprobabilitydividesoutthetotalnumberofcasessothatprobabilitiesarealwaysbetween0and1,whichletsusavoidthetediousbookkeepingofcountingcasesdirectlywhenallwewantisthefinalproportion.MultiplicationishowyouworkouttheprobabilitythateventAandeventBbothhappenwhentheeventsinquestionareindependent,thatis,onedoesn’taffecttheother.Whetherornotthisistrueisaquestionyourdatacannotanswer.Acoindoesn’tcareifitcameupheadsortailslasttime,butmaybeyourfriendwillgettiredoftoomanycakesinarow.

Wecanapplythemultiplicationruletoourassaultsdata.Supposewecanworkouttheprobabilitythatwe’llseeaquarterwith80orfewerassaultsjustbychance,eveniftheearlierclosingtimedidnothing.Callthisp(low).Thentheprobabilitythatwe’llseetwolowquartersinarowisp(low)×p(low),theprobabilityofseeingthreelowquartersinarowisp(low)×p(low)×p(low),andsoon.

Inpracticeyoudon’tworkoutprobabilitiesbydrawingtrees,justasyoudon’tworkoutthemarginoferrorbydrawingpicturesofsamples.Still,Ilovethinkingintermsoftreesofpossibilitiesbecauseitmakesplainwhatwearedoingwithprobabilityarithmetic.Eachbranchisapossiblecoursethroughhistory,andweareassigningprobabilitiesbycountingthebranchesofdifferenttypes.Allofstatisticsisbasedontheideaofcountingpossibilities.

TheCuriousJournalist'sGuidetoData

46CountingPossibleWorlds

Page 47: Curious Journalist s Guide to Data

ArguingFromtheOddsWecanusethelogicofcountingcasestoworkouttheprobabilityofanunlikelyeventhappeningbychance.Inthewinterof1976theUnitedStatesembarkedonanationwidefluvaccinationprogram,respondingtofearsofanH1N1virusepidemic(a.k.a.swineflu).Millionsofpeoplelinedupacrossthecountrytogetvaccinated.Butsomeofthemgotsickafter,orevendied.TheNewYorkTimeswroteaneditorial:

ItisdisconcertingthatthreeelderlypeopleinoneclinicinPittsburgh,allvaccinatedwithinthesamehour,shoulddiewithinafewhoursthereafter.Thistragedycouldoccurbychance,butthefactremainsthatitisextremelyimprobablethatsuchagroupof

deathsshouldtakeplaceinsuchapeculiarclusterbypurecoincidence.26

Butisitreally“extremelyimprobable?”NateSilverhasestimatedtheodds:

Althoughthislogicissuperficiallypersuasive,itsuffersfromacommonstatisticalfallacy.Thefallacyisthat,althoughtheoddsofthreeparticularelderlypeopledyingonthesameparticulardayafterhavingbeenvaccinatedatthesameparticularclinicaresurelyfairlylong,theoddsthatsomegroupofthreeelderlypeoplewoulddieatsomecliniconsomedayaremuchshorter.>>Assumingthatabout40percentofelderlyAmericanswerevaccinatedwithinthefirst11daysoftheprogram,thenabout9millionpeopleaged65andolderwouldhavereceivedthevaccineinearlyOctober1976.Assumingthattherewere5,000clinicsnationwide,thiswouldhavebeen164vaccinationsperclinicperday.Apersonaged65orolderhasabouta1-in-7,000chanceofdyingonanyparticularday;>theoddsofatleastthreesuchpeopledyingonthesamedayfromamongagroupof164patientsareindeedverylong,about480,000tooneagainst.However,underourassumptions,therewere55,000opportunitiesforthis“extremelyimprobable”eventtooccur—5,000clinics,multipliedby11days.TheoddsofthiscoincidenceoccurringsomewhereinAmerica,therefore,weremuch

shorter—onlyabout8to1against.27

Thisisamouthful.Itdoesn’thelpthatSilverisswitchingbetweenprobabilities(“a1-in-7000chance”)andodds(“480,000toone”).Butit’sjustabunchofprobabilityarithmetic.Theonlypartthatisn’tsimplemultiplicationis“theoddsofatleastthreesuchpeopledying.”Inpracticeyourcalculatorwillhavesomecommandtosolvethesesortsofcountingproblems.Themorefundamentalinsightisthatyoucanmultiplytheprobabilityofthreepeopledyingonthesamedayinthesamecitybythenumberofopportunitieswhereitcouldhappentoworkouthowoftenitshouldhappen.

TheCuriousJournalist'sGuidetoData

47ArguingFromtheOdds

Page 48: Curious Journalist s Guide to Data

Tobesure,thiscanonlybearoughestimate;thereisabigpileofassumptionshere,suchastheassumptionthatdeathratesdon’tvarybyplaceandtime.Butthepointofthisexerciseisnottonaildownthedecimals.We’reaskingwhetherornotweshouldbelievethatchanceisagoodexplanationforseeingthreepost-vaccinationdeathsinoneday,andweonlyneedanorder-of-magnitudeestimateforthat.Roughestimatescanbeincredibly

usefulforcheckingyourstory,andthere’satroveofpracticalloredevotedtothem.28

Theodds“8to1against”isaprobabilityof1/9,oran11percentchancethatwe’dseethreepeoplefromthesameclinicdieonthesameday.Isthisparticularlylongodds?Thisquestionishardtoansweronitsown.

Thelesslikelyitisthatsomethingcanoccurbychance,themorelikelyitisthatsomethingotherthanchanceistherightexplanation.Thissensiblestatementisnolessprofoundwhenyouthinkitthrough.Thisideaemergedinthe1600swhenthefirstmodernstatisticiansaskedquestionsaboutgamesofchance.Ifyouflipacoin10timesandget10heads,doesthatmeanthecoinisriggedorareyoujustlucky?Thelesslikelyitistoget10headsinarowfromafaircoin,themorelikelythecoinisafake.Thisprincipleremainsfundamentaltothedisentanglingofcauseandchance.

Coinsandcardsareinherentlymathematical.Randomdeathsareasortoflottery,whereyoucanmultiplytogethertheprobabilitiesoftheparts.Itcanbealittlehardertoseehowtocalculatetheprobabilitiesinmorecomplexcases.Thekeyistofindsomewayofquantifyingtherandomnessintheproblem.Oneoftheearliestandmostfamousexamplesofaccountingforchanceinasophisticatedwayconcernsafakesignature,millionsofdollars,andaviciousfeudoftheAmericanaristocracy.

In1865,SylviaAnnHowlandofMassachusettsdiedandleftbehinda2,025,000-dollarestate—thatwouldbeabout50milliondollarstoday.Butthewillwasdisputed,therewasalawsuit,andtheplaintiffarguedthatthesignaturewastracedfromanotherdocument.Tosupportthisargument,themathematicianBenjaminPeircewashiredtoprovethattheoriginalsignaturecouldnotmatchthedisputedsignaturesocloselypurelybychance.Thesignatureslookedlikethis:

TheCuriousJournalist'sGuidetoData

48ArguingFromtheOdds

Page 49: Curious Journalist s Guide to Data

AknowngenuineandtwopossiblyforgedsignaturesintheHowlandwillcase.FromMeierandZabell,1980.29

Toworkouttheprobabilityofthesetwosignaturesmatchingbychance,Peircefirstworkedouthowoftenasinglestrokewouldmatchbetweentwoauthenticsignatures.Hecollected42signaturesfromotherdocuments,allofthemthoughttobegenuine.Thenheinstructedhisson,CharlesSandersPeirce,tosuperimposeeachofthe861possiblepairsofthese42signaturesandcounthowmanyofthe30downward-movingstrokesalignedinpositionandlength.Charlesfoundthatthesamestrokeintwodifferentsignaturesmatchedonlyone-fifthofthetime.Thisisthekeystepofquantifyingrandomvariation,whichPeircedidbycountingthecoincidencesbetweensignaturesproducedinthewild.

Buteverystrokeofeverylettermatchedexactlybetweentheoriginalanddisputedsignatures.TheelderPeircewantedtoshowjusthowunlikelyitwasthatthiscouldhappenbychance,soheassumedthateverystrokewasmadeindependentlywhichallowedhimtousethemultiplicationruleforprobabilities.Sincethereare30strokesinthesignatureanda1/5chanceofanysinglestrokematching,hearguedthatthepositionsofthestrokesoftwogenuinesignaturesshouldmatchbychanceonlyoncein5×5×5×5×5×5×5×5×5×5×5×5×5×5×5×5×5×5×5×5×5×5×5×5×5×5×5×5×5×5times,thatis,

oncein530.Thisisafantasticallysmallnumber,a0.0000000000000000001percentchanceofarandommatch.Accordingtothiscalculation,ifyousignedyournamelikeMrs.Howlandanddiditabilliontimesyouwouldneverseethesamesignaturetwice;oneina

TheCuriousJournalist'sGuidetoData

49ArguingFromtheOdds

Page 50: Curious Journalist s Guide to Data

billionwouldbeamuchhealthier0.0000001percentchance.Amodernanalysiswhichdoesnotassumeindependenceofeachstrokegivesaprobabilityseveralordersofmagnitude

morelikely,butstillextraordinarilyunlikely.xv

ItseemedmuchmorelikelythatthesignaturewasforgedbyHettyRobinson,SylviaAnnHowland’sniecewhowascontestingthewill.RobinsonhadaccesstotheoriginaldocumentsandstoodtogainmillionsofdollarsbytracingMrs.Howland’ssignatureonanextrapagespellingoutfavorablerevisions.

IadmitI’mdisappointedthatthecasewasultimatelydecidedonothergrounds,renderingthisanalyticalgemlegallyirrelevant.Buttheeventwasamilestoneinthepracticaluseofstatistics.Statisticswasmostlyappliedtophysicsandgamblingatthattime,neveranythingasqualitativeasasignature.Thetrickherewastofindausefulwayofquantifyingthevariationsfromcasetocase.CharlesSandersPeircewentontobecomeoneofthemostfamousnineteenth-centuryscientistsandphilosophers,contributingtotheinventionofthe

randomizedcontrolledexperimentandthephilosophicalapproachknownaspragmatism.30

Theprobabilitythatyouwouldseedatalikeyourspurelybychanceisknownasthep-valueinstatistics,andthereisapopulartheoryofstatisticaltestingbasedonit.First,youneedtochooseatestthatdefineswhethersomedatais“likeyours.”Peircesaidapairofsignaturesis“like”thetwosignaturesonthewillifall30strokesmatch.Thenimagineproducingendlessrandomdata,likescribblingoutcountlesssignature,ormonkeysbangingontypewriters.Peircecouldn’tgetthedeceasedHowlandtowriteoutnewpairsofsignatures,sohecomparedallcombinationsofallexistingknowngenuinesignatures.Thep-valuecountshowoftenthisrandomdatapassesthetestoflookinglikeyourdata—thedatayoususpectisnotrandom.

There’saconventionofsayingthatyourdataisstatisticallysignificantifp<0.05,thatis,ifthereisa5percentprobability(orless)thatyou’dseedatalikeyourspurelybychance.Scientistshaveusedthis5percentchanceofseeingyourdatarandomlyastheminimumreasonablethresholdtoarguethataparticularcoincidenceisunlikelytobeluck,butthey

muchprefera1percentor0.1percentthresholdforthestrongerargumentitmakes.31Butbewarned:Nomathematicalprocedurecanturnuncertaintyintotruth!Wecanonlyfinddifferentwaysoftalkingaboutthestrengthoftheevidence.Therightthresholdtodeclaresomething“significant”dependsonhowyoufeelabouttherelativerisksoffalsenegativesandfalsepositivesforyourparticularcase,butthe5percentfalsepositivethresholdisastandarddefinitionthathelpspeoplecommunicatetheresultsoftheiranalyses.

Let’susethisp<0.05standardtohelpusevaluatewhetherthe1976fluvaccinewasdangerous.Bythisconvention,an11percentchanceofseeingthreepeoplerandomlydieonthesamedayisevidenceagainstaproblemwiththevaccine;youcouldsaytheoccurrenceofthesedeathsisnotstatisticallysignificant.Thatis,becausethereisagreater

TheCuriousJournalist'sGuidetoData

50ArguingFromtheOdds

Page 51: Curious Journalist s Guide to Data

than5percentchancethatwe’dseedatalikeours(threepeopledying)evenifthevaccineisfine,it’snotagoodbettoassumethatthesedeathswerecausedbyatoxicvaccine.Butthisdoesnotmeanthereisan11percentchancethatthevaccineissafe.Wehaven’tyetsaidanythingatallaboutthevaccine;sofarwe’veonlytalkedabouttheoddsofnaturaldeath.

Reallythequestionweneedtoaskiscomparative:Isitmorelikelythatthevaccineisharmful,orthatthethreedeathswerejustafluke?Andhowmuchmorelikely?Istheregreaterorlessthanan11percentchancethevaccineistoxicandnoonenoticedduringearliertesting?InthecaseoftheHowlandwill,wefoundminisculeoddsthattwosignaturescouldendupidenticalbyaccident.ButwhataretheoddsthatMrs.Howland’snieceforgedthewill?Amorecompletetheoryofstatisticstestsmultiplealternatives.

TheCuriousJournalist'sGuidetoData

51ArguingFromtheOdds

Page 52: Curious Journalist s Guide to Data

StatisticalInferenceThereisacompletelygeneralmethodofaccountingforchancewhichformsthebasisofmodernstatisticalreasoning.Inferenceistheprocessofcombiningexistingknowledgetogetnewconclusions,somethingwedoeveryday.Statisticalinferenceaddstheelementofuncertainty,wherebothourinformationandourconclusionshaveanelementofchance.

ThepropositionallogicoftheGreeksgaveusatemplateforreasoningwheneveryvariableisexactlytrueorfalse:“Ifitrains,thegrasswillgetwet.Thegrassisnotwet.Thereforeitdidnotraintoday.”Thetheoryofstatisticalinferenceextendsthistouncertaininformationanduncertainanswers:“Therewasa40percentchanceofraintoday.it’shardtosayfromjustlookingoutmywindow,butI’m70percentsurethegrassisdry.What’stheprobabilitythatitrainedtoday?”

ThemostcomprehensivemoderntheoryisusuallycalledBayesianstatisticsafteritsrootsinReverendBayes’stheoremof1763.Butthepracticalmethodwasonlyfullydevelopedinthetwentiethcenturywiththeadventofmoderncomputing.Ifyou’veneverseenthissortofthingbefore,it’sunlikelythatthislittleintroductionwillprepareyoutodoyourownanalyses.Wecan’tcoverallofBayesianstatisticsinafewpages,andanywaytherearebookson

that.xviwalkthroughaspecificBayesianmethod,ageneralwaytoanswermultiple-choicequestionswhentheanswerisobscuredbyrandomness.Mypurposeistoshowthebasiclogicoftheprocess,andtoshowthatthislogiciscommonsensicalandunderstandable.Don’tletstatisticsbemysterioustoyou!

Bayesianstatisticsworksbyasking:Whathypotheticalworldismostlikelytoproducethedatawehave?Andhowmuchmorelikelyisittodosothanthealternatives?Thepossible“worlds”arecapturedbystatisticalmodels,littlesimulationsofhypotheticalrealitiesthatproducefakedata.Thenwecomparethefakedatatotherealdatatodecidewhichmodelmostcloselymatchesreality.

Withthemultiple-choicemethodinthischapteryoucananswerquestionslike“howlikelyisitthattheaveragenumberofassaultsperquarterreallydecreasedaftertheearlierclosingtime?”Or“ifthispollhasNunezleadingJonesby3percentbutithasa2percentmarginoferror,whatarethechancesthatNunezisactuallytheoneahead?”Or“couldthetwentiethcentury’supwardglobaltemperaturetrendbejustafluke,historicallyspeaking?”

We’llworkthroughasmallexamplethathasthesameshapeasourassaultsversusclosingtimepolicyquestion.Supposethereisadangerousintersectioninyourcity.Notlongagotherewerenineaccidentsinoneyear!Butthatwasbeforethecityinstalledatrafficlight.Sincethestoplightwasinstalledtherehavebeenmanyfeweraccidents.

TheCuriousJournalist'sGuidetoData

52StatisticalInference

Page 53: Curious Journalist s Guide to Data

Accidentdatasurelyinvolvesmanyseeminglyrandomcircumstances.Maybetheweatherwasbad.Maybeaheartbrokendriverwasdistractedbyasongthatremindedthemoftheir

ex.Abutterflyflapsitswings,etc.xviiNonetheless,itisindisputablytruethattherewerefeweraccidentsafterthestoplightwasinstalled.

Butdidthestoplightactuallyreduceaccidents?Wemightsuspectthataproperstoplightwillcutaccidentsinhalf,butwehavetoregardthispossibilityasaguess,sowesayit’sahypothesisuntilwefindsomewaytoproveit.We’regoingtocomparethefollowinghypotheses:

1. Thestoplightwaseffectiveinreducingaccidentsbyhalf.

2. Thestoplightdidnothing,meaningthattheobserveddeclineinaccidentsisjustluck.

Thenextthingweneedisastatisticalmodelforeachhypothesis.Amodelisatoyversionoftheworldthatweuseforreasoning.Itincorporatesallourbackgroundknowledgeandassumptions,encapsulatingwhateverwemightalreadyknowaboutourproblem.Silverusedasimplemodel,basedontheoddsofanygivenpersondyingonanygivenday,toestimatetheoddsofthreepeopledyingonthesamedayatanyof5,000clinics.Peircecreatedamodelbasedonthestrokepositionsof42signaturesthatwereknowntobegenuine.Amodelisbydefinitionafake.It’snotnearlyassophisticatedasreality.Butitcanbeusefulifitrepresentsrealityintherightway.Creatingamodelisasortofquantificationstep,whereweencodeourbeliefsabouttheworldintomathematicallanguage.

TheCuriousJournalist'sGuidetoData

53StatisticalInference

Page 54: Curious Journalist s Guide to Data

Forourpurposesamodelisawaytogeneratefakedata,imaginedhistoriesoftheworldthatneveroccurred.We’llneedtwoassumptionstobuildasimplemodelofourintersection.We’llassumethatthesamenumberofcarspasseachday,andwe’llpickthenumberbasedonthehistoricaldatawehave.We’llfurtherassumethatthereissomepercentagechanceofeachcargettingintoanaccidentasitdoes,andagainwe’llusehistoricaldata,pre-stoplight,toguessattheproperpercentage.

Withthesetwonumbersinhandyoucanimaginewritingasmallpieceofcodetosimulatetheintersection.Aseachsimulatedcargoesintothesimulatedintersectionwecanflipasimulatedcointodeterminewhethertocountanaccident.Wecalibratethe“coin”sothecarscrashattheproperpercentage.Thisisareasonablemodelifwearewillingtoassumethatcaraccidentsareindependent:theremighthavebeenanaccidentatthisintersectionayearoranhouragobutthatdoesn’tchangetheoddsthatyouareabouttohavean

accident.xviii

Bysettingupthesimulationtoproducethesameaverageaccidentrateaswesawpre-stoplight,we’vebuiltamodeloftheintersectionwithoutthestoplightthatwehopematchestherealworld.Wecanusethismodeltogetafeelfortherangeofscenariosthatchancecanproducebyrunningthesimulationmanytimes,likethis:

TheCuriousJournalist'sGuidetoData

54StatisticalInference

Page 55: Curious Journalist s Guide to Data

Thefirsttwoyearsineachofthesechartsarejusttheoriginaldata,pre-stoplight.Thelastthreeyearshavebeengeneratedbysimulation.Insomeofthesealternatehistoriesthenumberofaccidentsdecreasedrelativetothepre-stoplightyears,andinothersthepatternwasincreasingormixed,allpurelybychance.Inordertocomparemodels,wefirstneedtopickamoreprecisedefinitionof“decline.”Solet’ssaythattheaccidents“declined”ifallthepost-stoplightyearsshowfeweraccidentsthananyofthepre-stoplightyears—justliketherealdatafromtheactualintersection.Thisisasomewhatarbitrarycriterion,butyourchoicedeterminesexactlywhichhypothesesyouaretesting.Justasoursimulationexpressestheworldincode,ourtestcriterionexpressesthehypothesesmathematically.Byourchosentest,scenarios4,6,and7showadecreaseintheaccidentrate.Wearecountingthebranchesofatreeofpossibilitiesoncemore.

Theykeynumberishowoftenweseetheeffectwithouttheallegedcause,justlikethevaccinedeathsandHowlandwillcase.Noneofthesealternatehistoriesincludeastoplight,yetweseeadeclineafterthesecondyearin3/9cases,whichisaprobabilityof0.33.Thismakesthe“chancedecline”theoryprettyplausible.Aprobabilityof0.33isa33percentchance,whichmaynotseem“high”comparedtosomethingthathappens90percentofthetime,butifyou’rerollingdiceyou’regoingtoseeanythingthathappens33percentofthetimeanawfullot.

Thisdoesn’tmakethe“chancedecline”hypothesistrue.Orfalse.Itespeciallydoesnotmeanthatthechancedeclinetheoryhasa33percentchanceofbeingtrue.Weassumedthat“chancedecline”wastruewhenweconstructedthesimulation.Inthelanguageofconditionalprobability,wehavecomputedp(data|hypothesis)whichisread“theprobabilityofthedatagiventhehypothesis.”Whatwereallywanttoknowisp(hypothesis|data),theprobabilitythatthehypothesisistruegiventhedata.Thedistinctioniskindofbrainbending,Iadmit,butthekeyistokeeptrackofwhichwaythedeductiongoes.

Aswesawinthelastsection,themorelikelyitisthatyourdatawasproducedbychance,thelesslikelyitwasproducedbysomethingelse.Buttofinishouranalysisweneedacomparison.Wehaven’tyetsaidanythingatallabouttheevidenceforthe“stoplightworked”theory.

Firstweneedamodelofaworkingstoplight.Ifwebelievethataworkingstoplightshouldcutthenumberofaccidentsinhalfinanintersectionlikethis,thenwecanchangeoursimulationtoproduce50percentfeweraccidents.Thisisanarbitrarynumber;amoresohisticatedanalysiswouldtestandcomparemanypossiblenumericalvaluesforthereductioninaccidents.Here’stheresultofsimulatinga50percenteffectivestoplightmanytimes:

TheCuriousJournalist'sGuidetoData

55StatisticalInference

Page 56: Curious Journalist s Guide to Data

Again,eachofthesechartsisasimulatedalternatehistory.Thefirsttwoyearsofdataoneachchartisourrealdataandthelastthreeyearsaresynthetic.Thistimethesimulationproduceshalfasmanyaccidentsonaverageforthelastthreeyears,becausethat’showeffectivewebelievethestoplightshouldbe.Byourcriterionthateverypost-stoplightyearshouldbelowerthaneverypre-stoplightyear,there’sareductioninaccidentsinsimulations1,2,4,5,6,7,and9.Thisis7outof9scenariosdeclining,ora7/9=0.78probabilitythatwe’dseeadeclineliketheoneweactuallysaw,ifthestoplightreducedtheoverallnumberofaccidentsbyhalf.

Thisisgoodevidenceforthe“stoplightcutaccidentsinhalf”hypothesis.Buttheprobabilityofseeingthisdatabychanceis0.33,whichisalsoprettygood.ThisisnotasituationlikeMrs.Howland’swillwheretheoddsofonehypothesiswereminiscule(identicalsignaturebychance)whiletheoddsoftheotherhypothesisweregood(forgedsignaturetogetmillionsofdollars).

Finallywearriveatanumericalcomparisonoftwohypothesesinthelightofchanceeffects.Thekeyfigureistheratiooftheprobabilitiesthateachmodelgeneratesdatalikethedataactuallyobserved.ThisiscalledthelikelihoodratioorBayesfactor,andyoucanthinkofitastheoddsinfavorofonemodelascomparedtoanother.ThekeyideaofcomparingmultiplemodelswasfleshedoutintheearlytwentiethcenturybyfiguressuchasR.A.

Fisher32andHaroldJeffreys.33

TheCuriousJournalist'sGuidetoData

56StatisticalInference

Page 57: Curious Journalist s Guide to Data

Theprobabilitythat“stoplightcutaccidentsinhalf”couldgenerateourdecliningdatais0.78whiletheprobabilitythat“chancedecline”accountsforthedatais0.33,sotheBayesfactoris0.78/0.33=2.3.Thismeansthattheoddsofthe“stoplightworked”modelgeneratingtheobserveddata,whencomparedtothe“chancedecline”model,are2.3to1infavor.

Thisdoesn’tmakethe“stoplightcutaccidentsinhalf”storytrue.Butitdefinitelyseemsmorelikely.

These2.3to1oddsaremiddling.Convertingtheoddstoaprobability,that’sa2.3/(2.3+1)=70percentchancethestoplightworked.Thatmeansifyouwriteastorywhichsaysitdidwork,there’sa30percentchanceyou’rewrong.Inothersituationsyoumighthavea90percentor99percentoreven99.9percentchanceofguessingcorrectly.Buttherecanbenofixedscaleforevaluatingtheodds,becauseitdependsonwhat’satstake.Would2.3to1oddsbegoodenoughforyoutorunastorythatmightlooknaivelater?Whatifthatstoryconvincedthecitygovernmenttospendmillionsonstoplightsthatdidn’twork?Whatifyourstoryconvincedthecitygovernmentnottospendmillionsonstoplightsthatdidwork,andcouldhavesavedlives?

Evenso,“stoplightworked”isabetterstorythan“chancedecline.”Abetterstorythaneitherwouldbe“stoplightprobablyworked.”Journalists,likemostpeople,tendtobeuncomfortablewithintermediateprobabilityvalues.A0percentor100percentchanceiseasytounderstand.A50/50chanceisalsoeasy:Youknowessentiallynothingaboutwhichalternativeisbetter.it’shardertoknowwhattodowiththe70/30chanceofour2.3to1odds.Butifthat’syourbestknowledge,it’swhatyoumustsay.

Inrealworkwealsoneedtolookatmorethanthedatafromjustonestoplight.Weshouldbetalkingtoothersources,lookingatotherdatasets,collectingallsortsofotherinformationabouttheproblem.Fortunatelythereisanaturalwaytoincorporateotherknowledgeintheformofpriorodds,whichyoucanthinkofastheoddsthatthestoplightworkedgivenallotherevidenceexceptyourdata.Thiscomesoutinthemathematicalderivationofthemethod,whichsaysweneedtomultiplyourBayesfactorof2.3to1bytheprioroddstogetafinalestimate.

Maybestoplighteffectivenessdatafromothercitiesshowsthatstoplightsusuallydoreduceaccidentsbutseemtofailaboutafifthofthetime,soyoupickyourprioroddsat4to1.Multiplyingbyyour2.3to1strengthensyourfinaloddsto9to1.Thelogichereis:stoplightsinothercitiesseemtowork,andthisoneseemstoworktoo,sothetotalityofevidenceisstrongerthanthedatafromjustthisonestoplight.

Ormaybeyouhavetalkedtoanexpertwhotellsyouthatstoplightsusuallyonlyworkinlargeandcomplexhighwayintersections,notthequietlittleresidentialintersectionwe’relookingat,soyoupickprioroddsof1to5,whichcouldalsobewritten0.2to1.Inthiscaseevenourveryplausibledatacan’toverwhelmthisstrongnegativeevidence,andthefinal

TheCuriousJournalist'sGuidetoData

57StatisticalInference

Page 58: Curious Journalist s Guide to Data

oddsare2.3x0.2=0.46to1,meaningthatit’smorethantwiceaslikelythatthestoplightdidn’twork.Thelogichereis:moststoplightsatthiskindofintersectiondon’twork,andthisunderminestheevidencefromthisonestoplight,whichleadsustobelievethattheobserveddeclineismorelikelythannotjustduetochance.

Multiplyingbythepriorismathematicallysound,yetit’softenunclearhowtoputprobabilitiesonavailableevidence.IfthemayorofDetroittellsyousheswearsbystoplightsinhercity,whatdoesthissayabouttheoddsofstoplightsworkingversusnotworkingasanumericvalue?Thereisnoescapefromjudgment.Butevenveryroughestimatesmaybeusefullycombinedthisway.Ifnothingelse,theexistenceofthepriorinstatisticalformulashelpfullyremindsustoconsultallothersources!

Thereisalotmoretosayaboutthismethodofcomparingthelikelihoodthatdifferentmodelsgeneratedyourdata.Themethodhereonlyappliestomultiple-choicequestions,whereasrealworkoftenestimatesaparameter:howmuchdidthestoplightreduceaccidents?Andwe’vebarelytouchedonmodeling,especiallythetroublingpossibilitythatallofyourmodelsaresuchpoorrepresentationsofrealitythatthecalculationsare

meaningless.xixButthefundamentallogicofcomparinghowoftendifferentpossibilitieswouldproduceyourobserveddatacarriesthroughtothemostcomplexanalyses.Ihopethisexamplegivestheflavorofhowasingleunifyingframeworkhasbeenusedtosolveproblemsinmedicine,cryptography,ballistics,insurance,andjustabouteveryotherhuman

activity.34Bayesianstatisticsissomethingremarkable,andIfinditswidesuccessincredible,unlikely,andalmostshockinglytoogoodtobetrue.Youcanalwaysstartfromthegeneralframeworkandworkyourwaytowardthedetailsofyourproblem.Thisissometimesmorework,butitistheantidotetostaringatequationsandwonderingiftheyapply.

TheCuriousJournalist'sGuidetoData

58StatisticalInference

Page 59: Curious Journalist s Guide to Data

WhatWouldHaveHappenedAnyway?Let’ssupposewe’veruledoutluckasanexplanationforourdata.Supposewehaveinferredthatsomethingintheassaultsdatareallydidchangearoundthetimethenewclosing-timepolicycameintoeffect.Attributingthischangetothenewclosingtimesisanothermatterentirely.

Itwouldbeeasytodeterminethetrueeffectsofthenewpolicyifweknewhowmanyassaultswewouldhaveseenhadthepolicynevergoneintoeffect.TosaythatAcausedBistosaythatBwouldnothavehappenedwithoutA.Butweonlyhavedatawiththepolicychange.Everystatementaboutcauseisreallyastatementaboutthewaytheworldwouldhavebeenwithoutthatcause,acounterfactualstatement.Thisisonereasonwhycausationissotricky:itrequiresreasoningaboutimaginaryworldsthatwecanneverobservedirectly.

Thisproblemcanonlyreallybesolvedwithatimemachine.Wecangobackintime,preventthenewclosingtimefromtakingeffect,thenwaittocollectequivalentdatainthisdivergentuniverse.Lackingatimemachine,we’llonceagainuseamodel,awayofdescribingthealternatehistorieswecan’teverobservedirectly.

IfwehadtwoidenticalcopiesofNewSouthWales,wecouldjustchangethepolicyinonecityandnottheother,andcomparetheresults.Thisisthelogicbehindthecontrolledexperimentwhereyougiveanewdrugtothetreatmentgroupandnottothecontrolgroup.Journalistsdon’tnormallygettodesignexperiments,andanywaytherearenevertwoidenticalcitiestoexperimenton.Butwecouldmakecomparisonswithsimilarcitiesorneighborhoods.

JustthissortofcomparisoncastsgreatdoubtonanattempttoreducegunviolenceinRichmond,Virginia,inthelate1990s.ProjectExileaimedtoreducethenumberofmurdersbyincreasingthepunishmentforillegalgunpossession(suchaswhenapreviouslyconvictedfelonisfoundtobecarryingagun).Theminimumsentencewaseffectivelyincreasedfromfiveto10yearsbyshiftingallsuchcasesfromstatetofederalcourts.

Atfirstglance,itworked.

TheCuriousJournalist'sGuidetoData

59WhatWouldHaveHappenedAnyway?

Page 60: Curious Journalist s Guide to Data

Gunhomicidesper100,000residentsinRichmond,Virginia,beforeandafterProjectExile.AdaptedfromRaphaeland

Ludwig,2003.35

Gun-relatedhomicides—byfarthemajorityofhomicides—decreasedafterProjectExilewentintoeffect.ThepolicywaswidelylaudedasasuccessbytheNationalRifleAssociation,TheNewYorkTimes,andPresidentGeorgeW.Bush.

ButtheevidenceforharshersentencesinRichmondisnotnearlyasstrongasitisforearlierclosingtimesinNewSouthWales.First,thedataisveryscarce.Thereareonlythreedatapointsaftertheprogramwasestablished,for1997,1998,and1999.Further,thenumberofgunhomicidesactuallyincreaseddramaticallyfor1997,eventhoughgunpossessionoffendersweretriedinfederalcourtsbeginninginFebruary1997.However,1998and1999doshowsoliddeclines,endinglowerthananythinginthepreviousdecade.

Let’stableforamomentthequestionofchance;withonlythreedatapoints,luckbecomesarealconcern.Supposewebelievethedeclineisrealandpermanent,andnotjustflukeduetonaturalvariation.WestillhavetheproblemofattributingcausetoProjectExileandnotsomethingelse.ReallywhatweneedisanotheridenticalRichmondtoshowusthealternatehistorywhereProjectExileneverhappened.

Wedon’thaveanotherRichmond,buttherearemanyothercities.Ifthosecitiesaresimilarenoughintherightways,theymightapproximatethelosthistorywhereRichmondneverhadaProjectExile.Here’sthehomicideratedatafromothercitieswhicharesimilarinvariousways,butnoneofwhichimplementedsuchaprogram.

TheCuriousJournalist'sGuidetoData

60WhatWouldHaveHappenedAnyway?

Page 61: Curious Journalist s Guide to Data

Gunhomicidesper100,000residentsinRichmond,Virginia,beforeandafterProjectExile,comparedtoothercities.From

RaphaelandLudwig,2003.36

VirtuallyeverycityintheUnitedStatesexperiencedadeclineingunviolenceinthelate1990s.Infactviolentcrimeofalltypesdecreasedallthroughthecountryduringthe1990s.

Noonereallyknowswhy,thoughtherearemanytheories.37Evidently,youdidn’tneedtochangesentencingguidelinesforillegalgunpossessiontoseeadropinguncrimeinthelate1990s.

MaybeyoucanstillsaythatRichmondhadalargerdecline.ButRichmondalsohadmorecrimetobeginwith,andabigspikein1997.Proportionally,asapercentagechange,Richmond’sdecreasewaswellinlinewithothercities.Youcanseethisifyouplotthedataonalogarithmicscale.

TheCuriousJournalist'sGuidetoData

61WhatWouldHaveHappenedAnyway?

Page 62: Curious Journalist s Guide to Data

Gunhomicidesper100,000residentsinRichmond,Virginia,andothercities,onalogarithmicscale.FromRaphaeland

Ludwig,2003.38

Eachverticalsteponalogarithmicscalecorrespondstoanincreasebyaconstantmultiplier,whichmeanswearecomparingpercentagechangeinsteadofabsolutenumbers.Whenwecomparethisway,Richmonddoesn’tlookparticularlybetterthanothertypesofcities.MostcitiesexperiencedadropingunviolenceofaboutthesamepercentageasRichmond,whichappearsonthischartasadecreaseofaboutthesameslope.Thisisevidencethatdoingnothingwouldhavebeenjustaseffective.

Hereyoucanhaveanargumentaboutwhetherpercentagechangeorabsolutenumbersaretherightwaytocompareadropincrimebetweencities.YoucanalsotrytoconstructmoreelaborateanalysesshowingthatwhilemurdersinRichmondwouldhavedroppedanyway,ProjectExilemadethemdropmore.We’refarfromthelastword,butwe’realsopastasimpleargumentthatProjectExilecausedtheobservedfall.

And,ofcourse,youcanjumpoutofthisframingentirelyandaskifincreasedpunishmentisreallythewaythatwe,asasociety,wanttodealwithatypeofcrimethatprimarilyinvolvesandaffectsalreadydisadvantagedgroups.Asalways,thedataisneverthefullstory.

TheCuriousJournalist'sGuidetoData

62WhatWouldHaveHappenedAnyway?

Page 63: Curious Journalist s Guide to Data

BacktoNewSouthWales,doestheclosing-timepolicychangesufferfromthesamesortof“wouldhavehappenedanyway”problem?Again,thetheoreticallyperfecttestwouldrequireanidenticalcopyofthecity.ButwedohavedatafromtheadjacentneighborhoodofHamilton,whichdidnotseearestrictiononclosingtimes.

Numberofassaultsperquarterinthecentralbusinessdistrict(CBD)ofNewSouthWales,whereclosingtimewasrestrictedto3a.m,andtheneighboringregionofHamiltonwhereitwasnot.FromKypri,Jones,McElduffandBarker,

2010.39

Andsureenough,therewasnoapparentreductioninassaultsinHamilton.ThemainweaknessofthissortofcomparisonisthatHamiltonisnotperfectlymatchedwiththeareawheretheclosingtimewaschanged.Ithasfewerbarsandafarlowerrateofassaultstobeginwith.Still,thiscomparativedataprovidesaminimalsanitycheck.Weneedtoexcludethepossibilitythatsomethingelsehappenedaroundthesametimethatloweredassaultratesgenerally.That’swhatseemstohavehappenedwithhomicidesinAmericancitiesinthelate1990s.Theotherreasonforlookingatthedatafortheadjacentdistrictistomakesurethatcrimewasactuallyreduced,notjustdisplacedtonearbyareas.

Anyclaimofcauseisimplicitlyaclaimaboutdatafromaworldwedon’tevergettosee:aworldwherethecauseneverhappened.it’sworththinkingabouthowtoapproximatethisworldthroughcomparisonsormodeling.Justlookingforincreasesordecreasesisnotenough.AstheProjectExileresearchersputit:

OnelargerlessonfromouranalysisofRichmond’sProjectExileistheapparenttendencyofthepublictojudgeanycriminaljusticeinterventionimplementedduringaperiodofincreasingcrimeasafailure,whilejudgingthoseeffortslaunchedduringthe

peakordownsideofacrimecycleasasuccess.40

TheCuriousJournalist'sGuidetoData

63WhatWouldHaveHappenedAnyway?

Page 64: Curious Journalist s Guide to Data

Andthat’sjustnotright.Thecorrectcomparisonisnot“upordown,”but“whatwouldhavehappenedotherwise?”Thisappliesjustaswelltothequestionofwhetherchickensoupcurescoldsasitdoestothequestionofwhetherharshersentencesdetercrime.

TheCuriousJournalist'sGuidetoData

64WhatWouldHaveHappenedAnyway?

Page 65: Curious Journalist s Guide to Data

CausalModelsCausecannotusuallybereaddirectlyfromthedata,nomatterhowmuchwemightwishthiswerethecase.Considerthisgraphofmortalityversussmokingrateacrossdifferentoccupations:

NormalizedmortalityrateversussmokingratefordifferentprofessionsintheUnitedKingdom,1970–1972.41

Thereisaclearassociationbetweensmokingandmortality—acorrelation.Itseemsnaturaltosaythatthisisevidencethatsmokingcontributestoanearlydeath.Buthowaboutthischart:

TheCuriousJournalist'sGuidetoData

65CausalModels

Page 66: Curious Journalist s Guide to Data

Correlationbetweencountries’annualpercapitachocolateconsumptionandthenumberofNobelPrizewinners.From

Messerli.42

Ifthepreviouschartshowsthatsmokingcausesprematuredeath,thenthischartshowsthateatingchocolatemakesyoumorelikelytowinaNobelPrize.No?Butthenwhydowebelievethefirstcorrelationiscausal,whilethisoneisn’t?Theremustbesomeotherfactorhere;ourreasoningmustbeincludingsomethingotherthanjustthedata.

Here’samoreambiguouscase:

TheCuriousJournalist'sGuidetoData

66CausalModels

Page 67: Curious Journalist s Guide to Data

U.S.quarterlyunemploymentrateversusinvestmenttoGDPratiofrom1990to2010,plottedbyJohnTaylor.43xx

Howwouldyoudescribethisgraph?Maybe:Wheninvestmentgoesup,unemploymentgoesdown.Butsayingitthatwaymakesitsoundlikeincreasinginvestmentwouldcauseunemploymenttodrop,andthat’snotnecessarilytrue.Wemightaswellsaythatwhenunemploymentgoesdown,investmentgoesup,implyingacauseintheotherdirection.Perhapswecouldsay:Investmentandunemploymentmovetogether,inoppositedirections.That’sallweactuallyknowfromthisdata,yetitfeelsunnaturaltowriteaboutanassociationbetweentwovariableswhilesayingnothingaboutthecausalrelationshipbetweenthem.Wearewiredtoseecauses.

Thedifferenceinourintuitionsaboutthesethreechartshastodowithwhetherornotweknowastorythatexplainshowthecauserelatestotheeffect.Youcanprobablyimaginehowinvestmentwouldleadtoemployment,orperhapshowemploymentwouldleadto

TheCuriousJournalist'sGuidetoData

67CausalModels

Page 68: Curious Journalist s Guide to Data

investment.You’vealsoprobablyheardthatsmokingcausescancer.Butthere’snoobviousstorythatlinkseatingchocolateandwinningaNobelPrize.

Wearedealingwithacorrelationhere,apatternintwovariablessuchthatwhenonechangestheotherchangesaswell.Therearevariousmathematicaldefinitionsofacorrelation,butforourpurposesthemoststraightforwardconceptionisfine.Scatterplotsareapopularwaytocomparetwovariables,butanythingwhichshowstwovariablescanrevealacorrelation.Oneofthosevariablesmightimplicitlybethetimeofanevent,asinourcrimeexampleswherewewerelookingatthecorrelationbetweenachangeinpolicyandthenumberofassaultsormurders.Here’sanothertypeofcorrelation,fromananalysisofmen

writingafirstmessagetowomenonthedatingsiteOKCupid:44

Thisdataseemstoshowthatincludingtheword“awesome”inafirstmessagewillcauseanaboveaveragereplyrate,whileincludingtheword“sexy”willcauseamuchlowerchanceofaresponse.Butthat’snotwhatthedataactuallysays.That’sjustastorythatleapstomind.it’seasytoimaginewhywomenwouldignoreacreepyfirstmessagefromastrangerwhocalledthem“sexy.”

TheCuriousJournalist'sGuidetoData

68CausalModels

Page 69: Curious Journalist s Guide to Data

Asusual,ourstoriesaboutthedatamayormaynotreflectreality,andtheprinciplemethodoftestingourstoriesistryingtoimaginehowelsethedatamighthavecometobe.Fortunately,therearenotthatmanywaystwovariablescanbecomecorrelated.

Theselittlegraphsarecausalmodels.Likeallstatisticalmodels,theyarenotrealitybutawayoftalkingandthinkingaboutreality.Eachcircleisavariable,somethingthatisorcouldbequantified.Eachlittlearrowmeans“causes.”Whatexactlya“cause”ishasbeendebatedsinceAristotle,butinthisframeworkitisdefinedintermsofpossibleinterventions:XcausesYmeansthatthereissomespecificthingyoucoulddointheworldtoforcethevariableXtotakeaspecificvalue,andifyoudidthattheoutcomeofYwouldchangeinaprobabilisticsense.

Thesecausesarenotdefinite.Tosaythatsmokingcausescancermeansthatifyoucouldforcesomeonetosmoke,theywouldbemorelikelytogetcancer.Notthattheywillgetcancer,butthatitincreasestheprobability.Thearrowsinthesediagramsarefuzzy,probabilisticcause.Insteadof“causes,”think“changesthedistributionof.”

Thislevelofabstractionletsustalkaboutcauseinaverygeneralway.Everycorrelationofanytwovariablesistheresultofoneofthesecausalpatterns,ormorelikelyacombination

ofthem.Usually,thedataalonecannottellyouwhichpatternproducedyourcorrelation.xxi

Forexample,XcausesYandYcausesXappearthesameinthedata.Wehavetouseotherinformationtofigureoutthecorrectcausalstructure.

TheCuriousJournalist'sGuidetoData

69CausalModels

Page 70: Curious Journalist s Guide to Data

Therecouldbenocausalrelationshipatall,justrandomcoincidencebetweenXandY.Aswe’veseen,coincidencecanbequantifiedbyestimatingtheprobabilitythatchancegeneratedyourdata.IntheOKCupidcasewecouldask:Howoftendoesarandomlychosenwordhaveanabove-orbelow-averageresponserateaslargeasthesewords?Ifweplottheresponseratesoflotsofwords,wemayfindthattheseparticularwordsarenotspecialatall;thischartcouldjustshowsomeparticularlyentertainingwordsthathavequiteordinaryfluctuationsinresponserate.Ifyoucancherry-picktheevidence,youcanprovewhateveryouwant.

ItcanalsobethatYcausesX,butnotinthiscase.Thereplycannotcausetheinitialmessagebecausecauseshavetocomebeforetheireffects.Inothercasesthecausalitycouldflowintheotherdirection,orthevariablescouldaffecteachotherinafeedbackloop.Highunemploymentmightbebothacauseandeffectoflowinvestment.Ifcitieswithmoregunsareassociatedwithhighercrime,itcouldbethataccesstoweaponscausescrime,oritcouldbethatlivinginadangerousplacemakespeoplewanttobuyagun.Ortheassociationcouldhavehappenedpurelybychance.

TheCuriousJournalist'sGuidetoData

70CausalModels

Page 71: Curious Journalist s Guide to Data

Inreality,it’sprobablysomecombinationofalloftheseeffects.Thedatayouhaveistheresultofpeopleusingthegunstheyhaveandpeoplebuyinggunsbecauseofthehighcrimerateandawholerangeofchancefactors.

ItcouldalsobethecasethatsomeotherfactorZcausesbothXandY.Forexample,therecouldbesomethingthatcausesamantowriteaboutawoman’sappearanceandcausesawomantoreplylessoften.Thisisthepossibilitymostoftenneglectedincasualdataanalyses,buttherecouldbeanynumberoffactorsthatwouldinfluencebothlanguageuseandresponserate.

TheCuriousJournalist'sGuidetoData

71CausalModels

Page 72: Curious Journalist s Guide to Data

Likeattractiveness.Perhapsattractivewomengetalotmoremessagesthanaverage—toomanytowanttoreplytoallofthem—sotheiroverallresponserateislower.Ifwebelievethat“attractiveness”isarealandcoherentnotionthatcouldbeusefullymeasuredinsomeway—perhapsbyaskingmanypeopletorateaphotograph—thenitisreasonabletotalkaboutitasavariable.Thisleavesuswithtwoplausiblehypotheses.

TheCuriousJournalist'sGuidetoData

72CausalModels

Page 73: Curious Journalist s Guide to Data

Thereisnowaytotellthesetwohypothesesapartfromthedataabove,becausebothwouldproducethesamecorrelations.

Thethirdvariableinthisthree-waystructureiscalledaconfounder,andconfoundingvariablesappearfrequentlyinrealworldanalyses.Thekeyistolookforanothervariablethatcausesbothofthevariablesyouseeasrelated.Forexample,overalleconomicgrowthcouldbothreduceunemploymentandincreaseinvestment.Arichcountrymightbothimportalotofchocolate—aluxurygood—andfundadvancedresearch.Thereductionincrimeratesafterthebar’sclosingtimechangedcouldbebecausethepolicebeganpatrollingtoenforcetheearlierclosingtime.

Butthenagain,astressfulprofessioncouldbothmakeyousmokeandreduceyourlifespan.Thetobaccoindustryhasattackedtheassociationbetweensmokinganddiseasefordecadesonpreciselythisbasisofpossibleconfoundingvariables(andmanyother

arguments45).Inthemid1960s,onestatisticianreceivedtobaccoindustryfunding“toseek

toreducethecorrelationofsmokinganddiseasesbyintroductionofadditionalvariables.”46

Asrepugnantasthismightbe,wehavetotakeseriouslythelogicalpossibilityofaspuriouscorrelation.Ultimately,theproofofsmoking’sharmalsoreliesonothertypesofnon-correlationalevidencesuchasanimalexperiments.Wecantellastoryaboutsmokecausingcancerthatwecanconfirminthelab.

Confoundingvariablesarecommoninpractice.Coffeemightcausecancer,butthenagain

maybeacertaintypeofpersonbothsmokesanddrinkscoffee.47Poorsleepmightcause

poorgradesinschool,orpovertymightcauseboth.48Theconfoundingcircumstancemaynotbemeasuredinthedatayouhaveandmaynotevenbesomethingthatcanbemeasureddirectly.Youcanonlyfindaconfounderbythinkingaboutthebroadercontextofthedata.

Onceyouhavefoundaconfoundingvariable,itmaybepossibletosubtractoffitseffect,aprocessthatiscalledcontrollingforavariable.Forexample,youcouldinvestigatetherelationshipbetweensmokingandcancerwhilecontrollingforthestressofdifferentprofessions.Thisonlyworksifyourcausalmodelisotherwiseaccurate.Again,it’sawaytoaskaboutacounterfactual:Whatwouldbetherelationshipbetweenemploymentandinvestmentifgrowthdidn’tdrivebothofthem?Orhowmuchwouldwomenmakeiftheyworkedthesamenumberofhoursasmen?Reasoningaboutimaginaryworldsisalwaystricky.

I’veusedpicturesinformallytotalkaboutcausalstructures,butthey’reactuallypartofawell-foundedmathematicaltheoryofcausedevelopedinlatetwentiethcenturybyJudea

Pearlandothers.49Thesepicturesarecalledgraphicalmodels,notbecausetheyare

TheCuriousJournalist'sGuidetoData

73CausalModels

Page 74: Curious Journalist s Guide to Data

graphicsbutbecausetheyaregraphsinthemathematicalsenseofnodesandedges.Youcanusethemtodescribemuchmorecomplexcausalstructureswithmorevariables,likethismodelfromoneofmyfavoritestatisticsbooks:

FromKaplan.50

Inthisinventednetworkwehavedataforthepinkvariablesbutnotthegrayvariable.Ingeneraltherewillbemanyinterveningfactorsyoucan’tmeasure,aswellasunknowncausesthatyoumayneverhavethoughtof.Youjustdon’tknowthecorrectcausalstructureoftheworld,butatleastyoucandrawlittlepicturesofthepossibilitiesyoucanimagine.

Thebestwaytofigureoutcausationistodoanexperiment.Afterall,causationisdefinedintermsofinterventions,andanexperimentisallaboutintervening.Intheonlinedatingcase,wecouldtakemanymenandrandomlytelleachonetoincludeorexcludecertainwordsintheirfirstmessagetoawoman,thentallytheresponserateforeachword.Thisisdifferentfromthedatawealreadyhaveinacrucialway.Inthisexperimentthemendonotdecidewhichwordstouse(wehaveintervened!).Theycannotbasetheirdecisiononthewoman’sappearance,orforthatmatteranythingaboutthemselvesorthewomantowhomtheyarewriting.Thisremovestheeffectofmanypotentialconfoundingvariablesinoneshot.

Thistypeofexperimentisageneralizationoftheideaofcomparingcases.Werepeataparticularscenariomanytimeswithandwithoutthehypotheticalcauseandseeiftheeffectappearsmoreoftenwhenthecauseispresent.JohnStuartMillwroteaboutthis“methodof

TheCuriousJournalist'sGuidetoData

74CausalModels

Page 75: Curious Journalist s Guide to Data

difference”inhis1843ASystemofLogic:

Ifaninstanceinwhichthephenomenonunderinvestigationoccurs,andaninstanceinwhichitdoesnotoccur,haveeverycircumstancesaveoneincommon,thatoneoccurringonlyintheformer;thecircumstanceinwhichalonethetwoinstancesdiffer,is

theeffect,orcause,oranecessarypartofthecause,ofthephenomenon.51

Millunderstoodthatitwouldnotalwaysbepossibletodistinguish“XcausesY”from“YcausesX”fromdataalone(“istheeffect,orcause”).Experimentsareonewayout,becausewesetthevalueofXandwatchwhathappenstoY.Thehitchisthatwedon’tknowwhatwouldhavehappenedtoYifwedidn’tsetX.Howmanynon-smokerswouldhavedevelopedlungcanceranyway?Thisiswhymodernexperimentsuseacontrolgroupforcomparison.Toensurethatthetwogroupsareotherwiseidentical(“everycircumstancesaveoneincommon”),wecanrandomlyassignpeoplebetweenthem.Thisbasicdesignwasformalizedattheendofthenineteenthcenturyandisknownasarandomizedcontrolledexperiment.

Butagain,journalistsdon’tnormallygettodoexperiments.Sometimeswecanevaluateotherpeople’sexperiments,butusuallywearereducedtodealingwithobservationaldata.Thismakescauseanespeciallytrickysubject.Causalmodels—ourlittlearrowdiagrams—areawayofexpressingthepossiblecausalrelationshipsbetweenvariables.Thiscanclarifyourthinkingandhopefullyleadtoideasabouthowtotestourstoriesagainstreality.

TheCuriousJournalist'sGuidetoData

75CausalModels

Page 76: Curious Journalist s Guide to Data

TruthbyEliminationIn2011theAssociatedPressrevealedthattheNewYorkPoliceDepartmenthadbeencloselymonitoring53NewYorkCitymosqueswithmethodsincludinginformantsandvideo

surveillance.52In2012,theNYPDreleasedamassivedatabaseofhundredsofthousandsofstop-and-friskincidents,wherecopsstoppedpeopleonthestreet,withoutcause,tocheckforweaponsanddrugs.Ajournalistanalyzedthisdataandfoundthattherewasa15percentaboveaveragenumberofstop-and-friskswithin100metersofcertainNewYorkCity

mosques.xxii

AsmallportionoftheNYPD’sstop-and-friskdata.

ThismightmeanthattheNYPDisdeliberatelytargetingMuslimsonthestreet.Buttherearemanyotherwaysthisdatacouldhavecometobe.Let’slistsomepossibilities:

PolicearedeliberatelystoppingMuslimsnearmosques.

It’ssheerchance.

TheCuriousJournalist'sGuidetoData

76TruthbyElimination

Page 77: Curious Journalist s Guide to Data

Mosquescouldbeinmoreheavilypopulatedareas.

Patroltimesmightcoincidewithprayertimes,forwhateverreason.

Theremightbemorepoliceassignedtotheareaduetohighercrimerates.

Thedatamightbeinerror.

Youcouldmisunderstandhowthedataiscollected.

Thisisthecentralproblemofdataanalysis:Thedataalonecannottellusthatastoryistrue,becausetherecouldbemanyotherstoriesthatproducethesamedata.Inprincipleallscientificanalysisisatwo-stepprocess:Inventanumberofhypotheses,thenpicktheonewhichisbestsupportedbyevidence.Injournalismwork,anarrativeextractedfromthedata—“thestory”—ismorallyequivalenttoahypothesis.

Actually,neitherscientistsnorjournalistsreallyworklikethis.Manypeoplehavepointedoutthattheinterplaybetweeninventingandtestingideasismuchmorecomplexthan

thislittlesketch.53Inrealworkyougobackandforth,refiningideas,gatheringmoreinformation,finallygettingyourinterviewwithacrucialsource,testingtheories,catchinguponotherpeople’swork,stumblingintoflashesofcreativity,drinkingalotofcoffee,arguingwithcritics,goingbacktothedrawingboard,changingyourmind,grindingforward.Weshouldnotconsiderthisideaofcreatingandthentestinghypothesestobealiteraldescriptionofourtruth-findingprocess.Insteaditdescribesatypeofargument.Itcapturesthecorelogicofwhyweshouldbelievesomethingistrue,notnecessarilythestepsthatactuallyledustobelieveit.

Comingupwithreasonablestories/hypothesesisacreativeprocessthathastodrawonspecificbackgroundknowledge.Peircecalledthishypothesis-generationprocessabductionandnoticedthatitfollowedcertainrules:Yourstoriesmustexplainthedata,andtheymustnotcontradictknownfacts.Otherthanthat,thepossibilitiesarewideopen.Butthereareanumberofthingsthatneedtobecheckedinalmostanystory.Yourlistofhypothesesshouldincludedefinitionalproblems,quantificationtroubles,errorsinthedata,randomchance,andasmanyconfoundingvariablesasyoucanthinkof.Thebasicruleisthis:youhavetoimagineitbeforeyoucanprovethatit’strue.

IsNYPDtargetingofMuslimsproducingourdata?Thetruthmaybeanyofthepossibilitiesabove,somecombination,orsomethingthat’snotevenonthelist.

Ifyouhavewell-quantifiedvariablesandgoodmodels,therearestatisticalsolutionstotheproblemofchoosingbetweencompetinghypotheses.Muchofthestatisticalworkofthelasthundredyearshasbeendevotedtojustthissortofhypothesistesting,aswesawinthesectiononinference.Thesearepowerfultools,butmostproblemsinjournalismdonothaveneatlyquantifiedevidence.Idon’tknowhowtoexpressallofthe

TheCuriousJournalist'sGuidetoData

77TruthbyElimination

Page 78: Curious Journalist s Guide to Data

abovestop-and-friskhypothesesinthesamesymboliclanguage,norhowtomakereasonableprobabilityestimatesforeachpossibility.What’sthechanceyou’vemisunderstoodthedataformat?Inpracticethesolutionistodouble-checktheformat,ratherthentryingtocomputeaprobabilityoferror.

Thereareexceptions,highlystructuredcaseswherethefullpowerofstatisticalhypothesistestingcanbeapplied,suchaselectionpredictions.Eventhen,bewary:Haveyouincludedallthedifferentwaystheelectioncouldberigged?Theworldwillalwaysfindwaystosurpriseamodel.

Ultimatelythereisnolanguagemorepowerfulthanhumanlanguage,andnoreasoningmorepowerfulthangeneralhumanreasoning.Thatdoesn’tmeanlookingatthedataandintuitingtheanswer.Therearemanymethodsbetweenintuitionandstatistics.

Gooddataanalysisismoreaboutrulingoutmanyfalseinterpretations,ratherthantryingtoproveasingleinterpretationiscorrect.Thismayseemdisappointing—cantherebenocertainty?—yetthisideaisoneofthegreatinnovationsinphilosophyofscience.ItwasbestarticulatedbyKarlPopperinthe1930s.Hiscentralideawasthatfalsificationisamuchmorerobustpracticethanverification.

Therearemanyreasonswhyprovingastorywrongisabettergoalthanprovingastoryright.Ifyouonlyeverlookforevidencethatconfirmsyourstory,youmayonlyeverfindtheevidencethatconfirmsyourstory.Disconfirmationisalsomorepowerfulthanconfirmationinthesensethatadditionalconfirmingevidencedoesn’treallymakeaconfirmedstorymoretrue,butonceastoryiscontradictedbyasinglesolidfactnoamountoffurtherevidencecanrescueit.Andweknow,startingwithaseriesoflandmarkcognitivepsychologyexperimentsinthe1970s,thattherearebiasesinhumancognitionthatleadustoreject,discredit,andselectivelyforgetinformationthat

doesn’tfitwithwhatwealreadybelieve.54

It’susefultoinquireagainstyourhopes.Yourcriticscertainlywill.

Also,falsificationisawayofclarifyingthepracticalcontentofahypothesis.Istheresomeway,atleastinprinciple,thatyourhypothesiscouldbeprovedwrong?Ifahypothesissaysanythingabouttheworld,itshouldbepossibletogocheckiftheworldreallyisthatway.Idon’tmeananythingcosmicbythis.“Thepoliceshiftchangehappensduringeveningprayers”isaperfectlygoodhypothesisthatcouldbetestedby,say,gettingacopyoftheprecinctschedule.

TheCuriousJournalist'sGuidetoData

78TruthbyElimination

Page 79: Curious Journalist s Guide to Data

TheCuriousJournalist'sGuidetoData

79TruthbyElimination

Page 80: Curious Journalist s Guide to Data

CarlSaganthrowsdown.xxiii

Theideaofgeneratingcompetinghypothesesandthendisprovingthemappearsinmanyforms,inmanyplaces.Aristotlewroteabouttheideaofdifferentpossiblecausesforthesameevent.Peircecertainlyunderstoodtheprinciplein1868whenheusedhissignaturemodeltoruleoutchanceasanexplanation.SirArthurConanDoylehadSherlockHolmestalkaboutfindingtruthbytestingalternativesin1926,inthequotethatopensthischapter.A1980sCIAtextbookonintelligenceanalysiscontainsaparticularlyreadabledescriptionofa

practicalmethod,neatlytiedtothetheoryofcognitivebiases.55

Inshort,themethodisthis:Atthebeginningofthedataanalysiswork,dreamupallsortsofpossibleinterpretations,allsortsofpossiblestories.Theavailabledatawillrulesomeofthemout,eitherobviouslysoorthroughstatisticaltesting.Thestorieswhichsurvivethattestaretheonesyouhavetochoosebetween.Todothat,youwillneedmoreinformation.Theremainingsetofhypotheseswilltellyouwhichinformationyouneedtoruleeachofthemout,whetherthat’sanotherdatasetoraconversationwithaknowledgeablesource.

Eachofthestop-and-friskhypothesessuggestsadifferentinvestigativetechnique.Wecanexaminetheeffectsofchancestatistically,perhapsbycountingthenumberofstopswithin100-meterradiuscirclesplacedrandomlythroughoutthedata,notcenteredonmosquesatall.Butprettymucheveryotherhypothesishastobetestedagainstinformationthatisn’tinthestop-and-friskdata.Wemightwanttoaddotherdatatotheanalysis;forexample,wecouldcorrelatemosquelocationswithpopulationdensity.Orwemightneedtohaveaconversationwithacopwhocanexplainhowpolicepatrolsareassigned.Thegoalhereisn’ttoproveanyparticularhypothesesbuttotesteachofthembyfindingevidenceagainstthem.

We’relookingforinformationwhichfalsifiesoneofourhypotheses.Realitymaynotbesocooperative.Thenextbestthingisinformationwhichprefersonehypothesistoanother:notfalsifyingevidencebutdifferentialevidence.Wemightalsofindthatacombinationofhypothesesfitsbest:TheNYPDmightbeintentionallystoppingMuslimsonthestreetandmosquesmightbeinmoredenselypopulatedareas.Thatitselfisanewhypothesis.

Themethodofcompetinghypothesesneednotinvolvedataatall.Youcanapplytheideaofrulingouthypothesestoanytypeofreportingwork,usinganycombinationofdataandnon-datasources.Theconceptoftriangulationinthesocialsciencescapturestheideathatatruehypothesisshouldbesupportedbymanydifferentkindsofevidence,includingqualitativeevidenceandtheoreticalarguments.Thattooisaclassicidea.Here’sPeirceagain:

TheCuriousJournalist'sGuidetoData

80TruthbyElimination

Page 81: Curious Journalist s Guide to Data

Philosophyoughttoimitatethesuccessfulsciencesinitsmethods,sofarastoproceedonlyfromtangiblepremiseswhichcanbesubjectedtocarefulscrutiny,andtotrustrathertothemultitudeandvarietyofitsargumentsthantotheconclusivenessofanyone.Itsreasoningshouldnotformachainwhichisnostrongerthanitsweakestlink,>butacablewhosefibersmaybeeversoslender,providedtheyaresufficiently

numerousandintimatelyconnected.56

Whatyouseeinthedatacannotcontradictwhatyouseeinthestreet,soyoualwaysneedtolookinthestreet.Theconclusionsfromyourdataworkshouldbesupportedbynon-datawork,justasyouwouldnotwanttorelyonasinglesourceinanyjournalismwork.

Thestoryyourunisthestorythatsurvivesyourbestattemptstodiscreditit.

TheCuriousJournalist'sGuidetoData

81TruthbyElimination

Page 82: Curious Journalist s Guide to Data

CommunicationThemarkofacivilizedhumanistheabilitytolookatacolumnofnumbers,andweep.-

attributedtoBertrandRussellxxiv

Quantificationproducesdataandanalysisbringsmeaningtoit.Butitdoesn’tcountasjournalismunlessyoucancommunicatewhatyou’velearned.Thisneedshapesthestoryallthewaythrough,includingquantificationandanalysis.

Injournalismweusuallyneedtoassumethattheaudiencehaslittlefamiliaritywitheitherthesubjectofthestoryorquantitativeconceptsingeneral,whichmakesthisparticularlydifficult.

Andafterreading,thereaderxxvinformation,orourjournalismhasnoeffect.Thistiesjournalismtoprediction.

Mostpeoplearenotusedtointerpretingdata,andit’shardtoblamethem.Datavisualizationcanbehelpfulbecauseittransferssomeofthecognitiveworkofunderstandingdatatotheenormouslypowerfulhumanvisualsystem.Still,thefoundationalconceptsofdataworkaresubtleandattimesunnatural.Thenuancesofsampling,probabilities,causality,andsoonareforeigntoeverydayexperience.Morethanthat,numbersarenotaparticularlyempatheticmedium.Formostpeopleeventhemostscreamingstatisticisdisconnectedfromeverydayexperience.Journalistscanovercomethisusingexamples,metaphors,orstoriestorelatethenumberstopeople.Journalismisadeeplyhumantask,nomatterthemethods.

Ultimately,ajournalistisresponsiblefortheideasthatendupintheirreader’shead.Therearetwopartstothis:ensuringthatthedataandthestoryclearlyandaccuratelyrepresentsthereality,andensuringthatthisaccuraterepresentationiswhatthereaderactuallycomesawaywith.

TheCuriousJournalist'sGuidetoData

82Communication

Page 83: Curious Journalist s Guide to Data

PerceptionQuick,whichoftheseshapesisdifferent?

Wellthatwaseasy.Howaboutnow?

Nowtrythisone.Whichshapeisdifferentfromallothershere?

TheCuriousJournalist'sGuidetoData

83Perception

Page 84: Curious Journalist s Guide to Data

Thefirsttwowereeasy,butthatonewasslightlyharder,right?Theseexamplesillustrateavisualabilitycalledthepop-outeffect,whichletsyoufindsomethinginaseaofsimilarobjectswithouthavingtothinkaboutit.Theobjectthatisdifferentjust“popsout”atyou.Exceptthatsometimesitworksbetterthanothers.Youprobablytookafewsecondslongertofindthesingleverticallightbarinthelastimage.

Pop-outsometimesworksandsometimesdoesn’tbecauseyouhave“hardware”inyourvisualsystemthatcanperformcomplexprocessingtasksbelowthelevelofyourconsciousness.Undertherightcircumstances,color,orientation,shape,texture,motion,depth,flicker,andmanyothervisualattributescancausepop-out.Butiftheproblemgetstoocomplexforyourhighlyspecializedvisualhardware,youhavenochoicebuttodoa“visualsearch”byscanningeachobject,likeaWhere’sWaldobook.

Yourvisualsystemcandoallsortsofotherneattricks,likecomparisons.

TheCuriousJournalist'sGuidetoData

84Perception

Page 85: Curious Journalist s Guide to Data

Youdon’thavetothinktoknowwhichobjectislargest,ortilteddownthemost,orwhetherthecirclesaredifferentcolors.Thisisthebasisofalldatavisualization:Wearerelyingonveryrapid,unconsciousabilitiesofthehumanvisualsystemtocommunicatedataquickly.Withawell-designedvisualization,youdon’tneedtothinkaboutittoseeatrendoracluster.

Datavisualizationresearchershaveidentifiedmanyimportantfeaturesofthehumaneyes

andbrain.57Therearedifferentvisual“channels”wemightusetoencodedata,suchasposition,size,color,orientation,shape,texture,motion,depth,andadozenmore,andfromexperimentsweknowtheeffectivenessofthesechannelsfordifferenttypesofrepresentation.Forexample,weknowthatpositionisthefastestandmostaccuratevisualchannelforcomparingquantities,whilecolorworksgreatforcategoricaldatabutpoorlyforcontinuousvariables.We’vemeasuredhowperceivedcontrastchangesdependingoncontext,andexploredhownoiseandcluttercanslowdownvisualtasks.Andwe’veteasedouthowpicturessaveonshort-termmemory.Withapictureinfrontofyou,youdon’tneedtostoretherelationshipsbetweenelementsinyourworkingmemory,becauseyoucanjustlookandsee.Thisfreesupyourthinkingformoresophisticatedthoughtsaboutthecontent.

Ourvisualprocessingsystemissofastandsophisticatedthatmaybeweshouldn’tthinkaboutitascognitionatall.Instead,it’sperception.Itfeelslikeyou“justsee”theimportantfeaturesofthevisualization.Butofcoursewedon’t“justsee.”Experimentershavemappedoutexactlywhatwedoanddon’tsee,andyoucantrainyoureyeovertime,too—likewhenyoulearnedtorecognizelettersandthenwords.

Consideringourvisualabilitiesleadstoimportantdesignchoices.OurunconsciousabilitytocomparelengthsiswhyyoushouldgenerallystarttheYaxisatzero.Otherwise,therelativelengthswon’tcorrespondtotherelativevalues,andwe’llperceiveincorrectrelationships.Ignoringvisualperceptionwhencreatingdatavisualizationsislikeignoringtheconsensusmeaningsofwordswhenwriting.

TheCuriousJournalist'sGuidetoData

85Perception

Page 86: Curious Journalist s Guide to Data

Butit’snotjustvisionweneedtounderstand.Wecan’tpossiblystudythecommunicationofdatawithoutstudyingthehumanperceptionofquantities.Howourstoryisperceiveddependsoneverythingfromvisiontocognitiontowhattheaudiencealreadybelieves.

TheCuriousJournalist'sGuidetoData

86Perception

Page 87: Curious Journalist s Guide to Data

RepresentationMostofwhatweknowcomesthroughsomeformofmedia,someformofsecondhandrepresentation.Agreatdealhasbeensaidonwhoandwhatgetsrepresentedinjournalism,andhowcertainpeopleandideasarepresented.Addingdatadoesnotchangethebasicnatureoftheseissues,butdataisadifferentkindofinformationthatlendsitselftodifferentkindsofcommunication.

Itendtothinkofinformationascomingintwodifferentflavors:examplesandstatistics.Thestoryofsomeonelookingforajobisanexample,whiletheunemploymentrateisastatistic.Peoplealsotalkaboutanecdotesversusdata,orcasestudiesversussurveys,ornarrativesversusnumbers,ormaybequalitativeandquantitative.Notallofthesepairsaretalkingaboutquitethesamething,buttheyallcapturesomekindofdifference.Idon’tthinkthesemodesofinformationareinopposition,oreventhattheboundaryisreallyallthatclear.(Whatwouldyoucalltheethnographiesofarandomlysampledsetofpeople?)ButIdoseetwoverygeneralpatternsinthewayinformationcanbecollected.

Youcancollectasmallamountofspecificinformationfrommanypeopleandsummarizeitwithstatistics.Oryoucancollectrich,open-endedinformationfromjustafewpeopleandpresenteachasanin-depthexample.Inthissensestatisticsandexamplesare

TheCuriousJournalist'sGuidetoData

87Representation

Page 88: Curious Journalist s Guide to Data

complementaryforms,andbothcanbeusedtorepresentabroadergroupofpeople.Thatis,bothcanbeusedtoinferinformationwedidnotcollect—additionaldetailsaboutthelivesofmorepeople.Allrepresentationisgeneralization.

Considerunemploymentagain.Asurveyasksafewquestionsofmanypeople,sothatwecancounthowmanypeopleareunemployed.Wecanalsofindpatternsofconnectionbetweenemploymentstatusandlocation,education,age,andsoon.Toseethesepatternstruly,withoutbias,wemusteithercounteverysinglepersonortakearandomsample.Thatis,arandomsampleisarepresentativesample.Butwealsoneedtounderstandthelivesofindividualpeople,orwecannoteverunderstandhowthesesocietalforcesplayoutinpractice.Maybeweknowthatpeopleofacertainracehavehigherunemployment,buthowdoesthisactuallyhappen?Whatgoesoninsuchaperson’slifewhentheyarelookingforajob?Whatdidtheyhearintheirlastinterview?Theunemploymentratecannotanswerthesesortsofquestions,butthestoriesofindividualpeoplecan.

Inthebestcase,astorycombinesnumbersandnarratives.Thedatarepresentsmanypeopleinanarrowbutmeaningfulway,whilestoriesrelatethedeepexperiencesofonlyafew,andthesedifferenttypesofinformationtogetherdescribeaunifiedreality.Butthisisonlywhat’sonthepage.

TheCuriousJournalist'sGuidetoData

88Representation

Page 89: Curious Journalist s Guide to Data

ExamplesTrumpStatisticsTakingresponsibilityfortheimpressionthatthereadercomesawaywithrequiresanunderstandingofhowpeopleintegratedifferenttypesofinformation.Andgenerally,examplesaremuchmorepersuasivethanstatistics—evenwhentheyshouldn’tbe.

TheUnitedStateshasseenatwo-decade-longdeclineinviolentcrimerates.Thisholdsacrosseverytypeofviolentcrimeandineveryplace.

Overthesameperiodoftime,therehasabeenaverywidespreadperceptionthatcrimeis

gettingworse.58

TheCuriousJournalist'sGuidetoData

89ExamplesTrumpStatistics

Page 90: Curious Journalist s Guide to Data

Thenumberofpeoplewhobelievethatcrimeisworsethisyearthanlasthashoveredaround60–80percentfordecades,evenasthenumberofpeoplewhohavebeenthevictimofaviolentcrimehasfallenbyafactorofthree.Gallupgoessofarastosay“perceptionsofcrimearestilldetachedfromreality…federalcrimestatisticshavenotbeenhighlyrelevant

tothepublic’scrimeperceptionsinrecentyears.”59

Howcanthisbe?ThereisawealthofdataoncrimeintheUnitedStates,mostofitfreelyavailable,andcrimeratefigureshavebeenrepeatedendlesslyinnewsstories.Surelythisisaneasilycorrectablemisperception.(Andit’sdefinitelyamisperception.Althoughthereareallsortsofissuesincountingcrime,violentcrimeratesarethoughttobethemostaccuratetypeofcrimedatabecausetheseriousnessofincidentslikehomicidemakesthemhardertohideandeasiertocount.)

Idon’tknowforcertainwhyperceptionissofarfromrealityinthiscase—Idon’tthinkanyonereallydoes—butthepatternfitswhatwe’veseeninexperiments.

Itwasnotuntilthe1970sthatresearchersinvestigatedthehumanperceptionofstatisticalinformationinaseriousway.Neartheendofthatdecade,Hamill,Wilson,andNisbettaskedasimplequestion:Howdoesstatisticalinformationchangetheperceptionofananecdote?60

Theseresearcherswantedtoseeifpeoplewoulddiscountanextremeexamplewhentheyweregivenstatisticsthatshowedittobeextreme.SotheyshowedoverahundredpeopleaNewYorkerarticleaboutawelfarerecipient:

TheCuriousJournalist'sGuidetoData

90ExamplesTrumpStatistics

Page 91: Curious Journalist s Guide to Data

Thearticleprovidedadetaileddescriptionofthehistoryandcurrentlifesituationofa43-year-old,obese,friendly,irresponsible,>ne’er-do-wellwomanwhohadlivedinNewYorkCityfor16years,thelast13ofwhichhadbeenspentonwelfare.ThewomanhademigratedfromPuertoRicoafterabrief,unhappyteenagemarriagethatproducedthreechildren.HerlifeinNewYorkwasanendlesssuccessionofcommon-lawhusbands,childrenatroughly18-monthintervals,anddependenceonwelfare.Sheandherfamilylivedfromdaytoday,>eatinghigh-pricedcutsofmeatandplayingthenumbersonthedaysimmediatelyafterthewelfarecheckarrived,andeatingbeansandborrowingmoneyonthedaysprecedingitsarrival.Herdwellingwasadecaying,

malodorousapartmentoverrunwithcockroaches…61

Thiswasarealperson,butshewasnotatypicalcase,becausealmostnoonestaysonwelfarefor13years.Onegroupofreadersalsosawstatisticalinformationshowingthiswasso:

StatisticsfromtheNewYorkStateDepartmentofWelfareshowthattheaveragelengthoftimeonwelfareforrecipientsbetweentheagesof40and55is2years.Furthermore,

90percentofthesepeopleareoffthewelfarerollsbytheendof4years.62

Theothergroupofreaderswasgivenfalsestatisticalinformationthatmade13yearsseemlikeanormallengthoftime:

StatisticsfromtheNewYorkStateDepartmentofWelfareshowthattheaveragelengthoftimeonwelfareforrecipientsbetweentheagesof40and55is15years.Furthermore,90percentofthesepeopleareoffthewelfarerollsbytheendof8

years.63

Theneveryonewasgivenabriefquizwithquestionsabouttheirperceptionofwelfarerecipientssuchas:

Howharddopeopleonwelfareworktoimprovetheirsituations?(1=>notatallhard,5

=extremelyhard)64

Asyoumightexpect,mostpeoplecameawayfromallofthiswitharathernegativeimpressionofpeopleonwelfare—muchmorenegativethanacontrolgroupwhodidnotreadthestory.Buttherewasnomeaningfuldifferenceintheopinionsofthosewhoreadtherealversusfakestatistics,andnodifferencewhenthestatisticswerepresentedbeforeversusafterthestory.

Thedescriptionofthewomaninhershabbyapartmentissovivid,soreal,soeasytoconnecttoourownexperiencesandculturalstereotypes.Itcompletelyoverwhelmsthedata.it’snotthatpeopledidn’tremembertheaveragelengthoftimesomeonestaysonwelfare;theywerequizzedonthat,too.Thestatisticalinformationsimplydidn’tfigureintothewaytheyformedtheirimpressions.

TheCuriousJournalist'sGuidetoData

91ExamplesTrumpStatistics

Page 92: Curious Journalist s Guide to Data

Icertainlydon’tblamereadersforthis;it’sneverworthwhiletoblameyourreaders.NoramIconvincedIwouldbeanydifferent.Idon’tthinkit’sclearenoughthatthiswomanwasatypical,vividexamplesarepersuasive,andreadershadnoreasontobeespeciallycareful.Ratherthanshakingmyfaithintheintelligenceofhumanity,Ijustseethisasalessoninhowtocommunicatebetter.

Therehavebeenotherexperimentsinasimilarvein,andtheyusuallyshowthatexamplestrumpstatisticswhenitcomestocommunication.Inonestudypeoplewereaskedtoimaginetheywerelivingwithchestpainfromanginaandhadtochoosebetweentwopossiblecures.Theyweretoldthatthecurerateforballoonangioplastywas50percentandthecurerateforbypasssurgerywas75percent.Theyalsoreadstoriesaboutpeoplewhounderwentdifferentsurgeries.Insomecasesthesurgerysucceededincuringtheiranginaandinsomeitfailed,buttheseexamplescontainednoinformationthatwouldbeofuseinchoosingbetweenthesurgeries.Evenso,peoplechosebypasssurgerytwiceasoftenwhen

theanecdotesfavoredit,completelyignoringthestatedoddsofacure.65

Whichbringsusbacktocrimereporting.Inmajorcities,noteverymurdermakesthenews.Indifferenttimesandplacesthenumberofreportedmurdershasvariedbetween30percent

and70percentofthetotal.66Thecrimesthatgetreportedarealwaysthemostserious.Contentanalysishasshownthatcoverageisbiasedtowardvictimswhoareyoung,female,white,andfamous,aswellascrimeswhichareparticularlygruesomeorsexual.Yettheseexamplesarethestufffromwhichourperceptionsareformed.it’senoughtomakeamediaresearcherweep:

Collectively,thefindingsindicatethatnewsreportingfollowsthelawofopposites—thecharacteristicsofcrimes,criminals,andvictimsrepresentedinthemediaareinmost

respectsthepolaroppositeofthepatternsuggestedbyofficialcrimestatistics.67

Notonlyiscrimereportingbiasedinastatisticalsense,butthepsychologicaldominanceofexamplesmeansthatreadersendupbelievingalmosttheoppositeofthetruth.Thisisatypeofmediabiasthatisseldomdiscussedorcriticized.

Ifyouwantthereadertowalkawaywithafairandrepresentativeideaofwhatthedatameansoutintheworld,thenyourexamplesshouldbeaverage.Theyshouldbetypical.Thisgoesupagainstjournalism’sfascinationwithoutliers.It’ssaidthat“manbitesdog”isnews,but“dogbitesman”isnot.Butifwewanttocommunicatewhatthebitedatasaysweshouldconsidergoingwith“dogbitesman”forourillustrativeexamples.

Myfavoritestoriesdrawonbothstatisticsandexamples,usingcomplementarytypesofinformationtobuildupafullandconvincingpicture.Butgenerally,examplesaremorepersuasivethanstatisticspresentedasnumbers.Individualcasesaremuchmorerelatable,

TheCuriousJournalist'sGuidetoData

92ExamplesTrumpStatistics

Page 93: Curious Journalist s Guide to Data

detailed,andvivid,andtheywillshapeperception.Thebadnewsisthatpoorlychosenexamplescancreateorreinforcebadstereotypes.Butthisalsomeansthatwell-chosenexamplesbringclarity,accuracy,andlifetoastory,aseverystorytellerknows.

TheCuriousJournalist'sGuidetoData

93ExamplesTrumpStatistics

Page 94: Curious Journalist s Guide to Data

WhoIsintheData?Dataaboutpeopleaffectspeople’slives.Urbanplanners,entrepreneurs,socialcritics,police—allkindsofpeopleusedata-basedrepresentationsofsocietyintheirwork.Thisiswhytheissueofrepresentationissoimportant.Changinghowsomeoneisperceived,oriftheyareperceivedatall,canhaveenormouseffects.

The“goodness”ofarepresentationdependsonwhatyouwanttodowithit—thestoryyouaretelling—butinmanycasesitseemsmostfairtocounteachpersonequally.Thereisanicealignmentherebetweendemocracyandstatistics,becausethesimplestwaytogeneratedataistocounteachiteminexactlythesameway.Randomsamplesarealsoverypopular,buttheyarejustapracticalmethodtoapproximatethisideal.Thismoral-mathematicalargumentontherepresentativenessofdataisalmostneverspelledout,butit’ssodeepinthewaywethinkaboutdatathatweusuallyjustsaydatais“representative”ofsomegroupofpeoplewhenitapproximatesasimplecount.

Thedatayouhavemaydeviatefromthisidealinimportantways.

Journalistshavebeentryingtoportraythepublictoitselfforalongtime.Whenyoureadanarticleaboutstudentdebtthatquotesafewstudents,thesestudentsarestandinginforallstudents.Broadcastjournalism’s“persononthestreet”interviewbringsthereaderintothestorybypresentingtheopinionsofpeoplewhoare“justlikethem.”Ofcourse,itneverreallyworksoutthatway;reportersonlyinterviewasmallnumberofnot-really-randompeople,andtelevisioncrewstendtofilmwhomeveriseasiesttogetoncamera.

WhenOsamabinLadenwaskilledin2011,theAssociatedPressundertookaprojecttogatherreactionsfromallovertheworld.Reportersrushedtopickupanycameratheyhadandaskthesamescriptedquestionofmanypeople.Butwhichpeople?Inpracticeitwilldependonfactorslikewhichreportersaremostkeenontheproject,whothereportersalreadyknow,whoiseasiesttogetto,andwhoismostlikelytospeakalanguagethereporterunderstands.Theprojectwasmeanttocapturetheglobalresponsetoahistoricevent,butit’snotclearwhosevoicesareactuallyrepresented.Aglobal,randomvideosampleonabreakingnewsdeadlinewouldbequiteachallenge,butperhapsyoucouldtrytogetacertainrangeofcounty,age,race,gender,andsoon.

Socialmediaseemstoofferawayout,becauseitrepresentssomanymorepeople.Nodoubtbulksocialmediaanalysiscanbeahugeimprovementoverahandfulofawkwardlychosensources.Butsocialmediaisn’treallyrepresentativeeither,notinthesensethatarandomsampleis.

Here’sNewYorkCity,asrevealedbygeocodedtweets:68

TheCuriousJournalist'sGuidetoData

94WhoIsintheData?

Page 95: Curious Journalist s Guide to Data

Ifindthismapbeautifulandrevealing.it’snotamapofgeographyorpoliticalboundaries,butamapofpeople.Ilovehowittracesmajortransitroutes,forexample.Butitisonlyamapofcertaintypesofpeople,asIknowfromcomparingittoapopulation-densitymap.TherearelargesparseareasinBrooklynwhereplentyofpeoplelive,andSohoisdefinitelynotasdenseasMidtown.Also,onlyafewpercentoftweetsaregeocoded.Whatsortofpersonusesthisfeature?

NoteveryoneisonTwitter,noteveryoneisTweeting,andevenfewerarespeakingonthetopicofyourstory.Thisdatahasabiastowardcertaintypesofpeople,andyoudon’treallyknowwhichkindofpeoplethoseare.Thereissurelyusefulinformationtobegotfromsocialmedia,butitisnotthesamekindofinformationyoucangetfromarandomsample.

TheCuriousJournalist'sGuidetoData

95WhoIsintheData?

Page 96: Curious Journalist s Guide to Data

Whetherornotthisisaproblemdependsonyourstory.Twitteruserstendtobeaffluentandurban,soifthat’sthepopulationyouwanttohearfrom,you’regood.Ifit’snot,theremaynotbemuchtosayfromaTwitteranalysis.Anyrepresentationofpublicsentimentcreatedfromsocialmediadata—awordcloudoranythingelse—willbebiasedinanunknownway.Thatis,theresultswillbeskewedrelativetoarandomsample,andtheworstpartisyouwon’tknowhowskewedtheyare.

Thewayyouchooseyourdatacanalsocreaterepresentativenessissues.Here’savisualizationbyMoritzStefanerthatismeanttoshowthe“Vizosphere,”thepeoplewhomakeupthedatavisualizationcommunity.

ExcerptfromtheVizospherebyStefaner.69

Ofcourseit’snotreallyavisualizationofeveryoneinvolvedwithvisualization.Tocreatethispicture,Stefanerstartedwith“asubjectiveselectionof‘seedaccounts,’”meaningtheTwitterhandlesof18peopleheknewtobeinvolvedinvisualization.The1,645peopleincludedinthepictureareallfollowingorfollowedbyatleastfiveoftheseaccounts.

Theresultisaveryinterestingrepresentationofsomepeopleinvolvedinvisualizationbutcertainlynoteveryoneinvolvedinvisualization.Whythese18accounts?Whynotincludepeoplewithfourlinksinsteadoffive?Partoftheproblemisthatthereisnouniversallyaccepteddefinitionofwhois“in”thevisualizationcommunity,buteveniftherewere,it’sdoubtfulTwitternetworkanalysiswouldbethewaytofindthemall.Thischartalmostcompletelyexcludesthescientificvisualizationcommunity,hundredsofpeoplewhohavebeendoingvisualizationfordecades.

TheCuriousJournalist'sGuidetoData

96WhoIsintheData?

Page 97: Curious Journalist s Guide to Data

Stefanerknowsthereareissuesofthissort,andsayssointhedescriptionofthisimage.There’snothingwrongwithallthis.Butifitweretobepresentedasjournalism,wouldreadersneedtoparsethefineprinttogetanaccurateunderstanding?

TheCuriousJournalist'sGuidetoData

97WhoIsintheData?

Page 98: Curious Journalist s Guide to Data

CommunicatingUncertaintyUncertaintyisarecurringthemeindatawork.It’sfamiliarinaway,becausewehaveallbeenunsure.ButIdon’tthinkmostpeoplehaveanaturalfeelforquantitativemeasuresofuncertainty.Isuspectthebestwaytogetafeelforuncertaintyistoplaywithsimulationsofprobabilisticthings,butyourreaderswon’thavedonethatsowehavetofindotherwaysofcommunicating.

We’veencounteredquantifieduncertaintymanytimesalready.Thesimplestwayofpresentinguncertaintyistogivearange:312±7miles.Themarginoferrorofasampleisamoresophisticatedmeasurethatincludeshowoftenweexpecttheerrortofallinthatrange:thepollnumberswere68percentinfavor,accuratetowithin3percent19timesoutof20.Probabilitiesarealsoakindofuncertainty:weanalyzedthestoplightdataandfoundthattheoddswere2to1infavorofthemodelwithaworkingstoplight.

Thesesortsofnumberscanbedifficulttograsponanintuitivelevel,yettheuncertaintyinaresultisakeypartofthatresult.Whenthedataisuncertainorleadstouncertainconclusions,itwouldbealietoomitthatuncertainty,orcommunicateitpoorly.

Therearemanywaystocommunicateuncertainty.Wecanshowitinavisualizationbyindicatingtherangeofpossiblevalues.

Expectedmarginofvictoryin2014elections,fromfivethirtyeight.com.70

Thisimagefromthe2014electionsshowshowthemarginoferroronthemarginofvictory

changedovertime.xxviItclarifiessomethingwhichisnototherwiseobvious:Thepollsshowedaconsistentleadformonths,yetitwasonlylateintheracethatvictorywas

TheCuriousJournalist'sGuidetoData

98CommunicatingUncertainty

Page 99: Curious Journalist s Guide to Data

particularlycertain.AllthroughSeptembertheoddswerecloserto60/40,onlynarrowingsubstantiallyinthesecondhalfofOctober.

Thegrayregionistherangeofvalueswheretheoutcomeisexpectedtofall90percentofthetime,the90-percentconfidenceinterval.Theeasiestwaytocomputethisrangeistosimulatelotsandlotsofelectionsusingamodelthatgeneratesrandomoutcomesaccordingtotheknownuncertaintyofthepollingdata,thenfindthe5thand95thpercentilestocutofftheoutliersonthebottomandtop.The90percentfigureisarbitrary,reallyjustconvention,butitprovidesareasonablebalance.Ifweshowedtheentire100percentrangeofthedata,thegrayregionwouldstretchtoincludeeveryflukescenario.Ifweshowedonlythecentral50percentthenreadersmightcomeawaywithanoverlynarrowimpressionoftheuncertainty,becausethetrueresultwouldfalloutsidethegrayareahalfthetime(assumingaproperlycalibratedpredictionmodel).

Wecanalsoshowuncertaintybypresentingtheresultsofsimulationswithrandomnessbuiltin.TheNewYorkTimesbuiltaroulettemachinetoexplaintheuncertaintiesinits2014electionpredictions.Eachstateisrepresentedbyawheeldividedintocoloredsegmentsaccordingtothethen-currentprobabilitiesthateachpartywouldwinthere.Whentheuserclicksthespinbutton,allwheelsspinandstopandatrandompositions,producingafinaltallyofsenateseats.

Anillustrationoftheuncertaintiesintheoutcomeofthe2014Senateraces.Eachtimetheuserpresses“spinagain”the

wheelsrotateandstopatarandomposition.FromTheNewYorkTimes.71*

TheCuriousJournalist'sGuidetoData

99CommunicatingUncertainty

Page 100: Curious Journalist s Guide to Data

Thisvisualizationreliesonthesamelogicweusedtoanalyzethestoplightdatainthelastchapter—itusesmanysimulationrunstoshowhowtheeffectsofchanceshapethedatawesee.Understandinghowsomeunderlyingrealityleadstotheobserveddatahelpsyoufigureoutwhattherealityiswhenyouaretryingtointerpretthedata.

Theseexamplesbothinvolvenumberswithsomeprobabilisticerror.Sometimeswhatweneedtocommunicateisjustaprobabilitybyitself.

Humanshaveanonlinearperceptionofnumericalprobabilities,astheydowithmanyotherperceptions(suchasbrightnesswhichisperceivedonalogarithmicscale).DanielKahnemanandAmosTverskypioneeredthemeasurementofprobabilityperceptioninthelate1970swithanexperimentthatgavepeopleachoicebetweentwobetswithgivenoddsandpayoffs.Theyshowedthatpeopledeviateinpredictablewaysfromthebeststrategyofvaluingabetaccordingtoitsaveragewinnings,whichyougetbymultiplyingtheprobabilityofwinningbythepayoff.Intheseexperiments,peopleactedasifsmalloddsweremuch

higherandlargeoddsweremuchlower.72Thatis,peoplebettoomuchwhentheoddsofwinningwerelow,andtoolittlewhentheoddsofwinningwerehigh,evenwhentheyknewtheexactodds!

TheCuriousJournalist'sGuidetoData

100CommunicatingUncertainty

Page 101: Curious Journalist s Guide to Data

Ifthisishowhumansdealwithprobabilityfiguresgenerally,thenweshouldexpectpeopletoexaggeratetheprobabilityofveryrareevents(likeplanecrashes)whileunderappreciatingtheprobabilityofverylikelyevents(likeheartdisease).

Thisisespeciallyaproblemwhencommunicatingsmallprobabilityfigures,suchasrarerisks.Theprobabilityofbeingstruckbylightninginyourlifetimeissomethingaround

0.0001.xxviiIt'snotimmediatelyobviouswhatthismeans,butthechartabovesuggeststhatreaderswilltendtoperceivegettingstruckbylightningasverymuchmorelikelythanitactuallyis.

Allsortsofthingsaffecttheperceptionoftheprobabilityofsomeevent.Iftheeventisvery

bad,wemayperceiveitasmorecommon.73Wewillalsoimagineittobemorecommonifit’seasytobringexamplestomind,acognitiveeffectknownastheavailabilityheuristic.Thus,dyinginaterroristattackcanseemjustasprobableasbeingstruckbylightningeven

TheCuriousJournalist'sGuidetoData

101CommunicatingUncertainty

Page 102: Curious Journalist s Guide to Data

thoughaconservativeestimateputslightingatleasttentimesmorelikely.Tellingpeopletheactualnumbersdoesnotchangethisperception,becausetheirperceptionisnotbasedonnumbers!

Onewaytocommunicateaprobabilityistotalkaboutitsfrequencyinterpretation,thatis,asacountofsomenumberofthingsoutofsomelargernumber.Whenwesaythatthelifetimeprobabilityofgettinghitbylightningis0.0001,wemeanthat1inevery10,000peoplewillbestruck.Thisisamuchmoreintuitivewayofthinkingaboutprobabilitiesformostpeople.Itmaybemorelikelytoleadtocorrectreasoningwhendiagnosingadiseaseormakingother

sortsofinferencesfromuncertainevidence.74Frequenciesworkparticularlywellifyoucancomparethedenominatortofamiliarunitsofpopulation.Let’ssaythereare10,000peopleisasmalltown;inacityofamillionpeople,100willbestruckbylightning.10,000islikelymuchmorethanthenumberofpeopleyouwillknowininyourlifetime,meaningthatyouprobablywon’tknowanyonewhohasbeenorwillbestruckbylightning.

Comparisonsareanotherusefulwaytocommunicateprobability.Theprobabilityofgettinghitbylightningis0.0001,buttheprobabilityofdyinginacarcrashis0.002,whichis20timesmorelikely.Again,thinkingintermsofpeoplehelps:Outof10,000people,onewillgethitbylightning,but20willdieinacarcrash.Getyourmeasurementsinunitsofpeoplewheneverpossible—it’saunitthateveryoneunderstands.Thisworksparticularlywellasavisualizationwithlittlepeopleicons:

Hitbylightning☺

Diesincrash☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺☺

Theratiooftheoddsofsomethinghappeninginonecaseasopposedtoanotheriscalledoddsratio,andit’sastandardfigureusedtocomparetwogroups.Heretheoddsratioofcarcrashversuslightningis(20/9980)/(1/9999)≈20.Oftentwogroupsarethoughttohavedifferentrisksorchancesofsomething,liketheprobabilityofheartdiseaseforthosewhodoanddonotexercise,ortheprobabilityofgettingintocollegeforthosewhowenttodifferenthighschools.

Anoddsratioclearlycommunicatestherelationbetweentwoodds,butitobscurestheoverallmagnitudeofeach.Sure,banningatoxicchemicalcanreducetheoddsofacertaintypeofcancerby2,butifonlytwopeopleareexpectedtogetthatcancerthenit’snotaverysignificantpublichealthintervention.Whereasatinyimprovementintheoddsofgettinglungcancermightsavethousandsoflives.

Itispossibletocommunicatebothabsoluteandrelativeoddsatthesametime.Here’ssmokingversusmortalityagain,thistimebyage:

TheCuriousJournalist'sGuidetoData

102CommunicatingUncertainty

Page 103: Curious Journalist s Guide to Data

Smokersversusnon-smokerssurvivalcurves,fromstubbornmule.net.75

Everythingyouneedtoknowisthere,butit’salittlehardtointerpret.Let’ssee…60percentofnon-smokerswillliveto80versus25percentofsmokers.Figuringoutwhatthisdatameansrequiresfartoomuchmessingaroundwiththechartandthinkingthroughfigures.Comparetothevisualization:

TheCuriousJournalist'sGuidetoData

103CommunicatingUncertainty

Page 104: Curious Journalist s Guide to Data

TheCuriousJournalist'sGuidetoData

104CommunicatingUncertainty

Page 105: Curious Journalist s Guide to Data

Smokersversusnon-smokerssurvivalcurves,fromstubbornmule.net.76

Thisvisualizationusesalltheprincipleswe’vediscussed.Itrepresentsprobabilitiesaspeople,andcomparesprobabilitiesbothbetweensmokersandnon-smokersandbetweendifferentages.Noonecanknowwhethertheywilldiefromsmoking,butvisualizationslikethiscanmaketheuncertaintiespersonal.

Therearelotsofquantitativecommunicationtricksandtechniquesyoucanpickup,andthevisualizationsherearenotthelastwordindesign.Butthemostimportantprincipleofcommunicatinguncertaintyisthis:Communicateit.Don’tletsomeonecomeawayfromyourstorywithawarpedsenseoftherisk,ortoocertainaboutsomethingsubtle.Thisisjustbasicrespectforthereaderandforthedifficultiesofknowing.

TheCuriousJournalist'sGuidetoData

105CommunicatingUncertainty

Page 106: Curious Journalist s Guide to Data

PredictionPredictionisimportantbecauseactionisimportant.Whatuseisjournalismthatdoesn’thelpyoudecidewhattodo?Thisrequiresknowledgeoffuturesandconsequences.Predictionalsohascloselinkstotruth.Falsificationisoneofthestrongesttruth-findingmethods,andit’spredictionthatallowsustocompareourideaswiththeworldtoseeiftheyholdup.Predictionisatthecoreofhypothesistesting,andthereforeatthecoreofscience.

Journaliststhinkaboutthefutureconstantly,andsometimespublishtheirpredictions:Aparticularcandidatewillwintheelection;thepresidentwillvetothebillifit’snotrevised;thiswarwilllastatleastfiveyears.Itmaybeevenmorecommontoletsourcesmakepredictions:Theanalystsaysthathousingpriceswillcontinuetoincrease;anewstudysaysthismanypeoplewillbeforcedtomoveastheseasrise.Leaningonexpertsdoesn’texcusethejournalistfromdisseminatingbadpredictionsunchallenged,anditturnsoutthatexpertsquiteoftenmakebadpredictions.

ThelandmarkworkhereisPhilipTetlock’sExpertPoliticalJudgment.77Startingin1984,Tetlockandhiscolleaguessolicited82,361predictionsfrom285peoplewhoseprofessionincluded“commentingorofferingadviceonpoliticalandeconomictrends.”Heaskedveryconcretequestionsthatcouldbescoredyesorno,questionslike:“WillGorbachevbeoustedinacoup?”or“WillQuebecsecedefromCanada?”

Theexperts’accuracy,over20yearsofpredictionsandacrossmanydifferenttopics,wasconsistentlynobetterthanguessing.AsTetlockputit,a“dart-throwingchimp”woulddojustaswell.Ourpolitical,financial,andeconomicexpertsare,almostalways,justmakingitupwhenitcomestothefuture.

Isuspectthisisdisappointingtoalotofpeople.Perhapsyoufindyourselfimmediatelylookingforexplanationsorrationalizations.MaybeTetlockdidn’taskthetrueexperts,orthequestionsweretoohard.Unfortunatelythemethodologyseemssolid,andthere’scertainlyalotofdatatosupportit.Theconclusionseemsinescapable:Weareallterribleatpredictingoursocialandpoliticalfuture,andnoamountofeducationorexperiencehelps.

Whatdoeshelpiskeepingtrackofyourpredictions.ThisisperhapsTetlock’sgreatestcontribution.

Althoughthereisnothingoddaboutexpertsplayingprominentrolesindebates,itisoddtokeepscore,totrackexpertperformanceagainstexplicitbenchmarksofaccuracy

andrigor.78

TheCuriousJournalist'sGuidetoData

106Prediction

Page 107: Curious Journalist s Guide to Data

Thesimplestwaytodothisisjusttowritedowneachpredictionyoumakeand,whenthetimecomes,tallyitasrightorwrong.Attheveryleastthiswillforceyoutobeclear.Likeabet,thetermsmustbeunambiguousfromtheoutset.

Amoresophisticatedanalysistakesintoaccountbothwhatyoupredictandhowcertainyouthinktheoutcomeis.Outofallthepredictionsthatyousaidwere70percentcertain,about70percentshouldcometopass.Ifyoutrackbothyourpredictionsandyourconfidence,youcaneventuallyproduceachartcomparingyourconfidencetothereality.AsTetlockputit,“Observersareperfectlycalibratedwhenthereisprecisecorrespondencebetweensubjectiveandobjectiveprobabilities.”

FromTetlock.79

Subjectiveprobabilityishowconfidentsomeonesaidtheywereintheirprediction,whiletheobjectivefrequencyishowoftenthepredictionsatthatconfidencelevelactuallycametrue.Inthisdata,whentheexpertsgavesomethinga60percentchanceofoccurring,theirpredictionscametopass40percentofthetime.Overall,thischartshowsthesamegeneralpatternfoundinotherstudiesofprobabilityperception:Rareeventsareperceivedasmuchtoolikely,whilecommoneventsarethoughttobeundulyrare.Italsoshowsthatexpertknowledgehelps,butonlytoapoint.“Dilettantes”withonlyacasualinterestinthetopicdidjustaswellasexperts,andstudentswhoweregivenonlythreeparagraphsofinformationwereonlyslightlyworse.

TheCuriousJournalist'sGuidetoData

107Prediction

Page 108: Curious Journalist s Guide to Data

Theoveralllessonhereisnotthatpeoplearestupid,butthatpredictingthefutureisveryhardandwetendtobeoverconfident.Anotherkeylineofresearchshowsthatstatisticalmodelsareoneofthebestwaystoimproveourpredictions.

In1954aclinicalpsychologistnamedPaulMeehlpublishedaslimbooktitledClinical

VersusStatisticalPrediction.80Histopicwasthepredictionofhumanbehavior:questionssuchas“whatgradeswillthisstudentget?”or“willthisemployeequit?”or“howlongwillthispatientbeinthehospital?”Thesesortsofquestionshavegreatpracticalsignificance;itisonthebasisofsuchpredictionsthatcriminalsarereleasedonparoleandscholarshipsareawardedtopromisingstudents.

Meehlpointedoutthattherewereonlytwowaysofcombininginformationtomakeaprediction:humanjudgmentorstatisticalmodels.Ofcourse,ittakesjudgmenttobuildastatisticalmodel,andyoucanalsoturnhumanjudgmentintoanumberbyaskingquestionssuchas“onascaleof1–5,howseriouslydoesthispersontaketheirhomework?”Buttheremustbesomefinalmethodbywhichallavailableinformationissynthesizedintoaprediction,andthatwilleitherbedonebyahumanoramechanicalprocess.

Itturnsoutthatsimplestatisticalmethodsarealmostalwaysbetterthanhumansatcombininginformationtopredictbehavior.

Sixtyyearsago,Meehlexamined19studiescomparingclinicalandstatisticalprediction,and

onlyonefavoredthetrainedpsychologistoversimpleactuarialcalculations.81Thisisevenmoreimpressivewhenyouconsiderthatthehumanshadaccesstoallsortsofinformationnotavailabletothestatisticalmodels,includingin-depthinterviews.Sincethentheevidencehasonlymountedinfavorofstatistics.Morerecently,areviewof136studiescomparingthetwomethodsshowedthatstatisticalpredictionwasasgoodorbetterthenclinicalpredictionabout90percentofthetime,andquitealotbetterabout40percentofthetime.Thisholdsacrossmanydifferenttypesofpredictionsincludingmedicine,business,andcriminal

justice.82

Thisdoesn’tmeanthatstatisticalmodelsdoparticularlywell,justbetterthanhumans.Somethingsareveryhardtopredict,maybemostthings,andsimplyguessingbasedontheoveralloddscanbeasgood(orasbad)asathoroughanalysisofthecurrentcase.Buttodothisyouhavetoknowtheodds,andhumansaren’tparticularlygoodatintuitivelycollectingandusingfrequencyinformation.

Infactthestatisticalmodelsinquestionareusuallysimpleformulas,nothingmorethanmultiplyingeachinputvariablebysomeweightindicatingitsimportance,thenaddingallvariablestogether.Inonestudy,collegegradeswerepredictedbyjustsuchaweightedsumofthestudent’shighschoolgradepercentileandtheirSATscore.Theweightswerecomputedbyregressionfromthelastfewyearsofdata,whichmakesthisastraightforwardextrapolationfromthepasttothefuture.Yetthisformuladidaswellasprofessional

TheCuriousJournalist'sGuidetoData

108Prediction

Page 109: Curious Journalist s Guide to Data

evaluatorswhohadaccesstoalltheadmissionmaterialsandconductedpersonalinterviewswitheachstudent.Thetwopredictionmethodsfailedindifferentways,andthosedifferencescouldmatter,buttheyhadsimilarlymediocreaverageperformance.

Theideathatsimplisticmechanicalpredictorsmatchorbeatexperthumanjudgmenthasoffendedmanypeople,andit’sstillnottakenasseriouslyasperhapsitshouldbe.Butwhyshouldthisbeoffensive?Meehlexplainedtheresultthisway:

Surelyweallknowthatthehumanbrainispooratweightingandcomputing.Whenyoucheckoutatasupermarket,youdon’teyeballtheheapofpurchasesandsaytotheclerk,“Wellitlookstomeasifit’sabout$17.00worth;whatdoyouthink?”Theclerk

addsitup.83

Ofcoursethestatisticalmodelsusedforpredictiondon’tchoosethemselves.Someonehastoimaginewhatfactorsmightberelevant,andthereisagreatdealofexpertiseandworkthatgoesintodesigningandcalibratingastatisticalmodel.Also,amodelcanalwaysbesurprised.Anelectionpredictionmodelwillbreakdowninthefaceoffraud,andanacademicachievementmodelcan’tknowwhatadeathinthestudent’sfamilywillmean.Moreover,therecanalwaysbenewinsightsintotheworkingsofthingsthatleadtobettermodels.Butgenerally,avalidatedmodelismoreaccuratethanhumanguesses,evenwhenthehumanhasaccesstolotofadditionaldata.

Ithinktherearethreelessonsforjournalisminallofthis.First,predictionisreallyhard,andalmosteveryonewhodoesitisdoingnobetterthanchance.Second,itpaystousethebestavailablemethodofcombininginformation,andthatmethodisoftensimplestatisticalprediction.Third,ifyoureallydocareaboutmakingcorrectpredictions,theverybestthingyoucandoistrackyouraccuracy.

Yetmostjournaliststhinklittleaboutaccountabilityfortheirpredictions,orthepredictionstheyrepeat.HowmanypunditsthrowoutstatementsaboutwhatCongresswillorwon’tdo?Howmanyfinancialreportersrepeatanalysts’guesseswithoutevercheckingwhichanalystsaremostoftenright?Thefutureisveryhardtoknow,butstandardsofjournalisticaccuracyapplytodescriptionsofthefutureatlastasmuchastheyapplytodescriptionsofthepresent,ifnotmoreso.Inthecaseofpredictionsit’sespeciallyimportanttobeclearaboutuncertainty,aboutthelimitationsofwhatcanbeknown.

Ibelievethatjournalismshouldhelppeopletoact,andthatrequirestakingpredictionseriously.

TheCuriousJournalist'sGuidetoData

109Prediction

Page 110: Curious Journalist s Guide to Data

GoingFurtherYouareprobablynoclosertofinishingyournextdataprojectafterreadingthisbook.

Iampainfullyawarethatthetheoryinthisbookissomewhatremovedfromthedailyworkofdatajournalism.You’regoingtoneedpracticalskillslikeworkingwithspreadsheets,cleaningdata,codingupvisualizations,andaskingcivilservantsforexplanations.I’vecoverednoneofthiscraft.

Yetallofthisworkisguidedbyoldanddeepprinciples.Journalistsarelatecomerstoquantitativethinking.That’sunfortunate,becausenumberscanbringusclosertothetruth.Butonlysometimes.Hopefullyyounowhaveabettersenseofthelimitationsofdata,andthewaysweanalyzeandcommunicatedata.

There’salotmoretolearn.

Thereareanendlessnumberoftechnicalconceptsrelevanttodatawork.I’vetriedtogiveanauthentictasteofthestateoftheart,andBayesianstatisticsandcognitivebiasesareattheforefrontofcontemporarypracticeacrossmanyfields.Still,thesepresentationsdonothavethedepthanddetailneededtodorealwork;nooneisgoingtolearntodostatisticalanalysisfromwhatI’vewritten.Notexactly.

Thegoodnewsisyoudon’thavetolearneverythingatonce.Aneducationinstatisticswillgiveyoupowerfulfundamentalsthatcanbeusedtoreasonaboutsubtleproblems,butyouwon’tneedtodothateveryday.Also,that’swhatcollaboratorsandmentorsarefor.Ajournalist’sprimaryresponsibilityistothestory,andtechnicalmasterycomesfromtheexperienceofmanysolvedproblems.

it’snotknowingeverythingthatmakesatechnicalprofessional,it’sbeingwillingtofindout.I’veusedstandardmathematicallanguageinanefforttohelpyoufindmoreinformation;withasearchengine,knowingthetruenameofsomethinggivesyoutheabilitytosummonitatwill.Sodon’tbesurprisedwhenyoudon’tknowsomething.Ifyou’reanythinglikemeyou’llgetthecodewrongthefirsttime,evenwhenyoudoknowwhatyou’redoing.Butneverdoubtthatthereisalogicunderlyingeveryequationandeverylineofcode.Thesethingsarenotmagic;thoughthesymboliclanguagesofdatacanbeintimidating,thereisnothingocculthere.

Myadviceistolookalwaysfortheunderlyingsenseofthething,theplain-languageexplanation.Thissensecanbehardtofind.Whenyouaskaquestionlike“whydoesasurveyhaveabell-shapederrordistribution?”youwillsoonfindyourselflostininscrutable

TheCuriousJournalist'sGuidetoData

110GoingFurther

Page 111: Curious Journalist s Guide to Data

proofs,answersthatseemtopresupposeyoualreadyknow,explanationsthatdon’treallyexplain.Thisisanunfortunatecommentonthesateofoureducationalmaterials,butdon’tlosehope!Keepsearchinguntilyoufindananswerthatmakessense.

Yetatechnicianisnotajournalist.Whatwillyoubeabletodowithallofthisunderstandingandability?

Likeanymedium,itcantakeawhiletofindyourvoiceindatajournalism.Sure,youcandoanalysisandvisualizationandalltherestofit—butwhatareyousaying?Whatquestionsareyouasking?Whatisitthatissoimportant,sourgent,thatyoumustcommandastranger’stimetotellittothem?

Idon’tknowofanywaytodiscoverwhatyouwanttosayotherthansayingit.Justwrite.Andreportandcodeandvisualize,butwhateverelseyoudo,putyourworkintotheworld.Thendothenextone.AsSteveJobssaid,realartistsship.

Ifyoucontinueyourstudyofthedeepworkingsofdata,youwilldiscoverentireworlds.Youwillretracethousandsofyearsofinspiredideas,re-experiencingeachlittleepiphanyasyourown.Youwillgraduallyarriveatoneofthemostexcitingfrontiersofhumanthought,andyouwilljoinprofessionalsinmanyotherfieldswhoaretransformingtheirworkthroughdata.Quantitativeideasnowpervadeeveryaspectofthefunctioningofsociety,fromhealthtofinancetopolitics.it’simpossibletounderstandthemodernworldwithoutunderstandingdata.

Andifyoudounderstanddata,youwillbegintoseestoriesthatothersliterallycannotimagine.Weneedthosestoriestold.Thatis,perhaps,thebestpossibleargumentforlearningmore.

TheCuriousJournalist'sGuidetoData

111GoingFurther

Page 112: Curious Journalist s Guide to Data

FootnotesiYoumightaswellexpandthattotherelationshipbetweenstoryandscience.It’savexing

question.See,forexample,GelmanandBasbøll.84

iiTheclassicdiscussionofthehumancreationofcategoriesisSortingThingsOut:

ClassificationandItsConsequences.85

iiiForathoroughdiscussionofraceonthecensus,seeSnipp.86

ivForafantasticlistof20reasonswhyquantificationisdifficultinpsychology,seeMeehl.87

vForareallyexcellentexpositionoftheproblemsofcounting“massshooting,”seeWatt.88

viNehemiah11:1.

viiFormoreonthesetwounemploymentsurveysandthedifferencebetweenthem,see

U.S.BureauofLaborStatistics.89

viiiActually60,000randomlychosenhouseholds,whichisabout150,000people.SeeU.S.

CensusBureau.90

ixSimilar,butnotidentical,becauseBernoulliinitiallyconsideredsampling“withreplacement,”whereeachpersonmightbechosenmorethanonce.Thisisprobablybecausesamplingwithreplacementismathematicallysimpler,andBernoulliworkedwithapproximateformulasthatbecomemoreaccurateasthenumberofsamplesincreases,ratherthantheverylargenumbersinvolvedincalculatingthenumberofpossibilitiesdirectly,whichrequirecomputers.

xI’mindebtedtoMarkHansenforthephrasingofthesetwokeysentences.

xiBeforeIgethatemail:Yes,itiswrongtosaythatthereisa90percentchancethatthetruevaluefallswithina90-percentconfidenceinterval.Thecontortionsoffrequentiststatisticsrequireustosayinsteadthatourmethodofconstructingtheconfidenceintervalwillincludethetruevaluefor90percentofthepossiblesamples,butwedon’tknowanythingatallaboutthisparticularsample.Thedistinctionissubtlebutreal.It’salsousuallyirrelevantforthistypeofsamplingmarginoferrorcomputation,wheretheconfidenceintervalisnumericallyveryclosetotheBayesiancredibleinterval,whichactuallydoescontainthetrue

valuewith90percentprobability.Seee.g.Vanderplas.91

TheCuriousJournalist'sGuidetoData

112Footnotes

Page 113: Curious Journalist s Guide to Data

xiiWhetherornotanythingis“truly”randomisametaphysicalquestion.Perhapstheuniverseisfullydeterministicandeverythingisfatedinadvance.Orperhapsmoredataorbetterknowledgewouldrevealsubtleconnections.Butfromapracticalpointofview,weonlycareifthesefluctuationsarerandomtous.Randomness,chance,noise:Thereisalwayssomethinginthedatawhichfollowsnodiscernablepattern,causedbyfactorswecannotexplain.Thisdoesn’tmeanthatthesefactorsareunexplainable.Theremaybetrendsorpatternswearen’tseeing,oradditionaldatathatmightbeusedtoexplainwhatlookslikechance.Forexample,wemightonedaydiscoverthatthenumberofassaultsisdrivenbytheweather.Butuntilwediscoverthisrelationship,wehavenoabilitytopredictorexplainthevariationsintheassaultratesowehavelittlechoicebuttotreatthemasrandom.

xiiiForafantastichistoryoftheseideas,seeIanHacking’sTheEmergenceofProbability.92

xivAlthoughthemathematicsturnoutthesame,there’sausefuldistinctionbetweensomethingwhichwemusttreatasrandombecausewedon’tknowthecorrectanswer(epistemicuncertainty)andsomethingwhichhasintrinsicrandomnessinitsfuturecourse(aleatoryuncertainty).Thedifferenceisimportantinriskmanagement,whereouruncertaintymightbereducedifwedidmoreresearch,orwemightbeupagainstfundamentallimitsofprediction.

xvPeirce’ssimpleargumentassumescompletestatisticalindependencebetweenthepositionsofeverystrokeinasignature.That’sdubious,becauseifyoumoveoneletterwhile

signing,therestoftheletterswillprobablyhavetomovetoo.Amorecarefulanalysis93

showsthatanexactsignaturematchismuchmorelikelythanonein530butstillphenomenallyunlikelytohappenbychance.

xviForabaggage-freeintroductiontoappliedBayesianstatsIrecommendMcElreath’s

StatisticalRethinking,orhismarvelouslecturevideos.94

xviiI’mreferencingthebutterflyeffect,theideathatthedisturbancesfromabutterflyflappingitswingsmighteventuallybecomeamassivehurricane.Moregenerally,thisistheideathatsmallperturbationsareroutinelymagnifiedintohugechanges.TheearlychaostheoristEdwardLorenzcameupwiththebutterflyanalogywhilestudyingweatherpredictionintheearly1960s.Inpractice,thisuncertaintyamplificationeffectmeanstherewillberandomvariationsinourdata,duetospecificunrepeatablecircumstances,thatwecannoteverhopetounderstand.

xviiiThistypeofindependenteventsmodelisalsocalledaPoissondistribution,aftertheFrenchmathematicianSiméonDenisPoisson,whofirstworkedthroughthemathinthe1830s.Butthenicethingaboutusingasimulationofourintersectionisthatit’snot

TheCuriousJournalist'sGuidetoData

113Footnotes

Page 114: Curious Journalist s Guide to Data

necessarytoknowthemathematicalformulaforthePoissiondistribution.Simplyflippingindependentcoinsgivesthesameresult.Simulationisarevolutionarywaytodostatisticsbecauseitsooftenturnsdifficultmathematicsintoeasycode.

xixMaybebothofyourhypothesesarewrong,andsomethingelseentirelyhappened.Maybeyourmodels,whicharepiecesofcode,aren’tgoodrepresentationsofyourhypotheses,whichareideasexpressedinlanguage.Maybeyourdataistheresultofbothaworkingstoplightandsomeamountofluck.Maybetheintersectionwasrebuiltafterthesecondyearwithwiderlanesandanewstoplight,andit’sreallythewiderlanesthatcausedthechange.Maybethebureaucracythatcollectsthisdatachangedthedefinitionof“accident”toexcludesmallercollisions.Ormaybeyouaddedupthenumberswrong.

xxUnemploymentversusinvestmentchartfromMankiw.95

xxiButsometimesitispossibletotellwhichoftwovariablesisthecauseandwhichistheeffectjustfromthedata,byexploitingthefactthatnoiseinthecauseshowsupintheeffect

butnotviceversa.SeeMooijetal.96

xxiiMichaelKeller,privatecommunication.

xxiiiIfoundthiscirculatingontheInternet,andwasunabletofigureoutwhomadeit.Muchlovetotheunknowncreator.

xxivItprobablywasn’tBertrandRussellwhofirstsaid,“Themarkofacivilizedhumanistheabilitytolookatacolumnofnumbers,andweep.”Butperhttp://quoteinvestigator.com/2013/02/20/moved-by-stats/thereisahistoryofquotingandmisquotingasimilarphrase.TheoriginaltextisRussell’sTheAimsofEducation:

>Thenextstageinthedevelopmentofadesirableformofsensitivenessissympathy.Thereisapurelyphysicalsympathy:Averyyoungchildwillcrybecauseabrotherorsisteriscrying.This,Isuppose,affordsthebasisforthefurtherdevelopments.Thetwoenlargementsthatareneededare:first,tofeelsympathyevenwhenthesuffererisnotanobjectofspecialaffection;secondly,tofeelitwhenthesufferingismerelyknowntobeoccurring,notsensiblypresent.Thesecondoftheseenlargementsdependsmainlyuponintelligence.Itmayonlygosofarassympathywithsufferingwhichisportrayedvividlyandtouchingly,asinagoodnovel;itmay,ontheotherhand,gosofarastoenableamantobemovedemotionallybystatistics.Thiscapacityforabstractsympathyisasrareasitisimportant.

ManyothersattributethepithierquotetoRussell,buttheoriginalsourceforthatisnowheretobefound.Ireallyliketheshorterquotenomatterwhereitultimatelycamefrom;it’safinestringofwords.

xxvI’llusereaderasagenericnamefortheconsumerofastory,withapologiestoreportersworkinginotherformats.

xxviTotallyfuntosay.

xxviiLifetimeoddsofbeingstruckbylightningestimatedat1in12,000byNOAA,basedon

2004–2013averages.97

TheCuriousJournalist'sGuidetoData

114Footnotes

Page 115: Curious Journalist s Guide to Data

Citations1. DeniseSchmandt-Besserat,“TokensandWriting:TheCognitive

Development,”SCRIPTA(2009):145–154,http://sites.utexas.edu/dsb/files/2014/01/TokensWriting_the_Cognitive_Development.pdf.

2. “TableA-15:AlternativeMeasuresofLaborUnderutilization,”U.S.BureauofLaborStatistics,http://www.bls.gov/news.release/empsit.t15.htm.

3. JonathanStray,“EthicsinDataJournalism:MarginofErrorinBureauofLa-borStatisticsReports,”DataDrivenJournalism,15January2016,http://datadrivenjournalism.net/news_and_analysis/ethics_in_data_journalism_margin_of_error_in_bureau_of_labor_statistics_rep.

4. GeorgeCobb,“TheIntroductoryStatisticsCourse:aPtolemaicCurriculum,”TechnologyInnovationsinStatisticsEducation,1(2007),http://escholarship.org/uc/item/6hb3k0nz.

5. JamesC.Scott,SeeingLikeaState(NewHaven:YaleUniversityPress,1998).

6. DavidHestenes,“OerstedMedalLecture2002:ReformingtheMathematicalLan-guageofPhysics,”AmericanJournalofPhysics,104(2003),http://dx.doi.org/10.1119/1.1522700.

7. BrianGrattonandMyronP.Guttman,“HispanicsintheUnitedStates1850–1990,”HistoricalMethods,3(2000),http://www.latinamericanstudies.org/immigration/Hispanics-US-1850-1990.pdf.

8. DavidNiose,“Anti-IntellectualismIsKillingAmerica,”PsychologyToday,23June2015,https://www.psychologytoday.com/blog/our-humanity-naturally/201506/anti-intellectualism-is-killing-america.

9. G.KitsonClark,TheMakingofVictorianEngland(NewYork:Routledge,1962),

10.11. ChrisDavisandMatthewDoig,“StateScrapsFelonVoterList,”SarasotaHerald-

Tribune,12July2004,http://www.heraldtribune.com/article/20040712/NEWS/

12.13. MattWaite,“HandlingDataAboutRaceandEthnicity,”OpenNewsSource,20June

2014,https://source.opennews.org/en-US/learning/handling-data-about-race-and-ethnicity/.

TheCuriousJournalist'sGuidetoData

115Citations

Page 116: Curious Journalist s Guide to Data

14. “SixteenthDecennialCensusoftheUnitedStates,InstructionstoEnumerators,PopulationandAgriculture,”U.S.CensusBureau,1940,http://www.census.gov/history/pdf/1940instructions.pdf.

15. JensManuelKrogstadandMarkHugoLopez,“‘Mexican,’‘Hispanic,’‘LatinAmer-ican’TopListofRaceWrite-insonthe2010Census,”PewResearchCenter,4April2014,http://www.pewresearch.org/fact-tank/2014/04/04/mexican-hispanic-and-latin-american-top-list-of-race-write-ins-on-the-2010-census/.

16. “DirectiveNo.15asAdoptedonMay12,1977,”U.S.CensusBureau,1977,http://wonder.cdc.gov/wonder/help/populations/bridged-race/directive15.html.

17. JerzyWojewodaetal.,“HystereticEffectsofDryFriction:ModellingandEx-perimentalStudies,”PhilosophicalTransactionsoftheRoyalSocietyA,1866(2008),http://rsta.royalsocietypublishing.org/content/366/1866/747.

18. “EmploymentSituationTechnicalNote,”U.S.BureauofLaborStatistics,2015,http://www.bls.gov/news.release/empsit.tn.htm.

19. NeilIrwinandKevinQuealy,“HowNottoBeMisledbytheJobsReport,”TheNewYorkTimes,1May2013,http://www.nytimes.com/2014/05/02/upshot/how-not-to-be-misled-by-the-jobs-report.html?_r=0.

20. “HowtheGovernmentMeasuresUnemployment,”U.S.BureauofLaborStatis-tics,2015,http://www.bls.gov/cps/cps_htgm.htm.

21. “EmploymentSituationTechnicalNote.”

22. MarianneDurandandPhilippeFlajolet,“LoglogCountingofLargeCardinal-ities,”inESA(2003),605–617,http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.2718.

23. SirArthurConanDoyle,“TheAdventureoftheBlanchedSoldier,”inTheCase-BookofSherlockHolmes(1927),54.

24. MikeBostocketal.,“OneReport,DivergingPerspectives,”TheNewYorkTimes,5October2012,http://www.nytimes.com/interactive/2012/10/05/business/economy/one-report-diverging-perspectives.html.

25. JamesFallows,“WhytoGetMoreThan1Newspaper,iPadEdition,”TheAt-lantic,22October2013,http://www.theatlantic.com/national/archive/2013/10/why-to-get-more-than-1-newspaper-ipad-edition/280772/.

26. KyprosKyprietal.,“EffectsofRestrictingPubClosingTimesonNight-timeAssaultsinanAustralianCity,”Addiction,2(2011),http://onlinelibrary.wiley.com/enhanced/doi/10.1111/j.1360-0443.2010.03125.x/.

TheCuriousJournalist'sGuidetoData

116Citations

Page 117: Curious Journalist s Guide to Data

27. Ibid.

28. NateSilver,TheSignalandtheNoise:WhySoManyPredictionsFail—ButSomeDon’t(NewYork:Penguin,2012),484.

29. Ibid.

30. SanjoyMahajan,Street-FightingMathematics:TheArtofEducatedGuessingandOpportunisticProblemSolving(Cambridge:MITPress,2010).

31. MeierandZabell,“BenjaminPeirceandtheHowlandWill.”

32. IanHacking,“Telepathy:OriginsofRandomizationinExperimentalDesign,”Isis,3(1998),http://www.jstor.org/stable/234674.

33. GerardE.Dalal,“WhyP=0.05?”http://www.jerrydallal.com/LHSP/p05.htm.

34. AndersHald,“OntheHistoryofMaximumLikelihoodinRelationtoInverseProbabilityandLeastSquares,”StatisticalScience,2(1999),http://www.jstor.org/stable/2676741.

35. RobertKassandAdrianRaftery,“BayesFactors,”JournaloftheAmericanStatisticalAssociation,430(1995),http://www.jstor.org/stable/2291091.

36. SharonBertschMcGrayne,TheTheoryThatWouldNotDie:HowBayes’RuleCrackedtheEnigmaCode,HuntedDownRussianSubmarines,andEmergedTriumphantfromTwoCenturiesofControversy(NewHaven:YaleUniversityPress,2011).

37. StevenRaphaelandJensLudwig,EvaluatingGunPolicy:EffectsonCrimeandViolence(Chicago:BrookingsInstitutionPress,2003),251–277,http://home.uchicago.edu/ludwigj/papers/Exile_chapter_2003.pdf.

38. Ibid.

39. StevenD.Levitt,“UnderstandingWhyCrimeFellinthe1990s:FourFactorsThatExplaintheDeclineandSixThatDoNot,”TheJournalofEconomicPerspectives,1(2004).

40. RaphaelandLudwig,EvaluatingGunPolicy:EffectsonCrimeandViolence.

41. Kyprietal.,“EffectsofRestrictingPubClosingTimesonNighttimeAssaultsinanAustralianCity.”

42. RaphaelandLudwig,EvaluatingGunPolicy:EffectsonCrimeandViolence.

43. OccupationalMortality:TheRegistrarGeneral’sDecennialSupplementforEnglandandWales,1970–1972(London:HerMajesty’sStationeryOffice,1978),http://lib.stat.cmu.edu/DASL/Datafiles/SmokingandCancer.html.

TheCuriousJournalist'sGuidetoData

117Citations

Page 118: Curious Journalist s Guide to Data

44. FranzH.Messerli,“ChocolateConsumption,CognitiveFunction,andNobelLaureates,”NewEnglandJournalofMedicine(2012):1562–1564.

45. GregMankiw,“AStrikingScatterplot,”29March2011,http://gregmankiw.blogspot.com/2011/03/striking-scatterplot.html.

46. Ibid.

47. ChristianRudder,“ExactlyWhattoSayinaFirstMessage,”OKCupidblog,2009,http://blog.okcupid.com/index.php/online-dating-advice-exactly-what-to-say-in-a-first-message/.

48. Milbergeretal,“TobaccoManufacturers’DefenceAgainstPlaintiffs’ClaimsofCancerCausation:ThrowingMudattheWallandHopingSomeofItWillStick,”TobaccoControl(December2006):iv17–iv26,http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2563590/.

49. AndrewGelman,“StatisticsforCigaretteSellers,”Chance,3(2012),http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics4.pdf.

50. BikaramjitMannandEvanWood,“ConfoundinginObservationalStudiesExplained,”TheOpenEpidemiologyJournal(2012),http://benthamopen.com/contents/pdf/TOEPIJ/TOEPIJ-5-18.pdf.

51. JamesF.Pagel,NatalieForister,andCarolKwiatkowki,“AdolescentSleepDisturbanceandSchoolPerformance:TheConfoundingVariableofSocioeconomics,”Jour-nalofClinicalSleepMedicine,1(2007).

52. JudeaPearl,Causality:Models,Reasoning,andInference,2ndEdition(Cambridge:CambridgeUniversityPress,2009).

53. DanialKaplan,StatisticalModeling:AFreshApproach,SecondEdition(ProjectMosaic,2012).

54. JohnStuartMill,ASystemofLogic,Vol.1(:1843),455.

55. MattApuzzoandAdamGoldman,“DocumentsShowNYPoliceWatchedDevoutMuslims,”AssociatedPress,6September2011,http://www.ap.org/Content/AP-In-The-News/2011/Documents-show-NY-police-watched-devout-Muslims.

56. PhilipKitcher,TheAdvancementofScience:ScienceWithoutLegend,ObjectivityWithoutIllusions(Oxford:OxfordUniversityPress,1993).

57. DanielKahneman,ThinkingFastandSlow(NewYork:Farrar,StrausandGiroux,2013).

TheCuriousJournalist'sGuidetoData

118Citations

Page 119: Curious Journalist s Guide to Data

58. Jr.RichardsJ.Heuer,ThePsychologyofIntelligenceAnalysis(:CIA,1999),https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/art11.html.

59. CharlesSandersPeirce,“SomeConsequencesofFourIncapacities,”JournalofSpeculativePhilosophy(1868):140–157.

60. TamaraMunzner,“Visualization,”inFundamentalsofComputerGraphics,ThirdEdition,ed.PeterShirleyandSteveMarschner(AKPeters,2009),675–707,http://www.cs.ubc.ca/labs/imager/tr/2009/VisChapter/.

61. JustinMcCarthy,“MostAmericansStillSeeCrimeUpOverLastYear,”Gallup,21November2014,http://www.gallup.com/poll/179546/americans-crime-last-year.aspx.

62. Ibid.

63. RuthHamill,TimothyDeCampWilson,andRichardE.Nisbett,“InsensitivitytoSampleBias:GeneralizingFromAtypicalCases,”JournalofPersonalityandSocialPsychology,4(1980).

64. Ibid.

65. Ibid.

66. Ibid.

67. Ibid.

68. AngelaFagerlin,CatharineWang,andPeterA.Ubel,“ReducingtheInfluenceofAnecdotalReasoningonPeople’sHealthCareDecisions:IsaPictureWorthaThousandStatistics?”MedicalDecisionMaking,4(2005).

69. Stray.

70. JessicaM.PollakandCharisE.Kubrin,“CrimeintheNews:HowCrimes,OffendersandVictimsArePortrayedintheMedia,”JournalofCriminalJusticeandPopularCulture,1(2007).

71. MiguelRíos,“TheGeographyofTweets,”Twitter,31May2013,https://blog.twitter.com/2013/the-geography-of-tweets.

72. MoritzStefaner,“TheVIZoSPHERE,2011,”2011,http://www.visualizing.org/full-screen/29391.

73. “SpecialCoverageofthe2014Midterms,”FiveThirtyEight,4November2014,http://fivethirtyeight.com/live-blog/special-coverage-the-2014-midterms/?#livepress-update-20137747.

TheCuriousJournalist'sGuidetoData

119Citations

Page 120: Curious Journalist s Guide to Data

74. TheNewYorkTimes,“WhoWillWintheSenate?”4November2014,http://www.nytimes.com/newsgraphics/2014/senate-model/.

75. ElkeWeber,“FromSubjectiveProbabilitiestoDecisionWeights:TheEffectofAsymmetricLossFunctionsontheEvaluationofUncertainOutcomesandEvents,”PsychologicalBulletin,2(1994).

76. AdamJ.L.HarrisandAdamCorner,“CommunicatingEnvironmentalRisks:ClarifyingtheSeverityEffectinInterpretationsofVerbalProbabilityExpressions,”JournalofExperimentalPsychology:Learning,Memory,andCognition,6(2011).

77. UlrichHoffrageetal.,“RepresentationFacilitatesReasoning:WhatNaturalFrequenciesAreandWhatTheyAreNot,”Cognition(2002),http://www.sciencedirect.com/science/article/pii/S0010027702000501.

78. “VisualizingSmokingRisk,”StubbornMule,21October2010,http://www.stubbornmule.net/2010/10/visualizing-smoking-risk/.

79. Ibid.

80. PhillipE.Tetlock,ExpertPoliticalJudgment:HowGoodIsIt?HowCanWeKnow?(Princeton:PrincetonUniversityPress,2005).

81. Ibid.

82. Ibid.

83. PaulMeehl,ClinicalVersusStatisticalPrediction:ATheoreticalAnalysisandaReviewoftheEvidence(Minneapolis:UniversityofMinnesota,1954).

84. QuinnMcNemar,“ReviewofClinicalVersusStatisticalPrediction:ATheoreticalAnalysisandaReviewoftheEvidencebyPaulE.Meehl,”TheAmericanJour-nalofPsychology,3(September1955).

85. WilliamM.Groveetal.,“ClinicalVersusMechanicalPrediction:AMeta-analysis,”PsychologicalAssessment,1(2000).

86. PaulMeehl,“CausesandEffectsofMyDisturbingLittleBook,”JournalofPersonalityAssessment,3(1986).

TheCuriousJournalist'sGuidetoData

120Citations