evaluating the accuracy of linked u. s. census data: a

55
†Correspondence should be directed to: Diana Magnuson University of Minnesota, 50 Willey Hall, 225 19th Ave S., Minneapolis, MN 55455 e-mail: [email protected], phone: 612-624-5818, fax:612-626-8375 "Evaluating the Accuracy of Linked U. S. Census Data: A Household Linking Approach" The Systematic Linking of Historical Records, University of Guelph, May 10-13, 2017 Ronald Goeken University of Minnesota Yu Na Lee University of Minnesota Tom Lynch University of Minnesota Diana Magnuson† Bethel University December 2017 Working Paper No. 2017-1

Upload: others

Post on 29-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluating the Accuracy of Linked U. S. Census Data: A

†Correspondence should be directed to: Diana Magnuson University of Minnesota, 50 Willey Hall, 225 19th Ave S., Minneapolis, MN 55455 e-mail: [email protected], phone: 612-624-5818, fax:612-626-8375

"Evaluating the Accuracy of Linked U. S. Census Data: A

Household Linking Approach"

The Systematic Linking of Historical Records, University of Guelph,

May 10-13, 2017

Ronald Goeken University of Minnesota

Yu Na Lee

University of Minnesota

Tom Lynch University of Minnesota

Diana Magnuson† Bethel University

December 2017

Working Paper No. 2017-1

Page 2: Evaluating the Accuracy of Linked U. S. Census Data: A

1

Introduction

Despitetheproliferationofpublishedstudiesusinglinkeddecennialcensusrecordstherehas

beenlittleempiricalworkontheaccuracyofthelinkeddata.Theprimaryreason,ofcourse,is

thatyoucanneverdefinitivelystatethattworecordstakenfromtwodistinctcensusesrepresent

thesameperson.Giventheabsenceofuniqueidentifiers(e.g.,socialsecuritynumbers)

matchinghistoricalcensusrecordsdependsonhighsimilaritybetweenprimarylinkage

variables;e.g.,names,age,sex,andplaceofbirth.Potentiallinksarethenclassifiedastrueor

falseaccordingtorulesormachinelearningprocedures.Estimatinglinkageratesisa

straightforwardexercise,buterrorratescanonlybemeasuredindirectly.

Thegoalofmosthistoricalcensuslinkageprojectsistocreatelinkeddatathatdoesnotinclude

corroborativeevidencederivedfromco-residentkinandmigrationstatusbecauseofbiasissues.

Thisisavalidconcern,butitisalsopossiblethatrelyingonlinkagemethodsthatignoreafair

amountofcorroborativeevidencecomesatacost.Theobviouseffectwouldbetolowerlinkage

rates.Apotentiallymoresignificantconcernwouldbetheeffectonerrorrates.Themainissueis

ifthetruelinkisunidentifiable(becauseofunder-enumerationoramismatchorlowsimilarity

onkeylinkagevariables),thenanylinktothisrecordwillbefalse.

Mostrecordlinkageprojectsmoreorlessassumethattheinabilitytofindtruelinksdueto

mismatchesorlowsimilarityforkeylinkagevariablesisarelativelyminorissue.Ourstrategyfor

investigatingthistopicistouseamaximumamountofinformationtoestablishasetofverified

links.Primarily,weplanonusingthepresenceofcommonkinandresidentialstability(i.e.,living

inthesameplace)insuccessivedecennialcensusestosupplementsimilarityattheindividual

level.Althoughmanytruelinkswillnothavecorroborativehouseholdorresidentialinformation,

wefindthatmanycanbeverified.Theseverifiedlinkswillthenbeusedtooptimizeblocking

strategiesandtotestproceduresusedtoclassifypotentiallinksgeneratedbyindividuallevel

classifiers,primarilybyconstructinglinkageanderrorrates.

Thisisstillourbasicmissionstatement.However,thenineteenthcenturylinking--whichispart

ofafive-yearprojectexaminingdemographicchangeintheaftermathoftheAmericanCivil

Page 3: Evaluating the Accuracy of Linked U. S. Census Data: A

2

War--isstillinprogress.Weprovideastatusreportinthelasthalfofthepaper,butinthefirst

halfwediscussthedevelopmentofthehouseholdlinkingprocess.1

The1880Complete-CountLinkageProject(2003-2009)

In2003theMinnesotaPopulationCenterbeganworkonaprojectthatwouldeventuallylink

thecomplete-countdatabaseofthe1880U.S.populationcensustosamplesofother19thand

early20thcenturyU.S.decennialcensuses.Theoriginalgrantassertedthatwewouldestablish

linksattheindividuallevelandonlyuseasetofvariablesthatwouldminimizelinkingbias;i.e.,

names,age,sex,race,andplaceofbirth.Wedidnotuseplaceofresidenceorinformation

gleanedfromco-residentkinbecauseofbiasconcerns(i.e.,thatnon-migrantsandthoseliving

withthesamekininbothcensuseswouldbeoverrepresentedinthelinkedpopulation).2

Thedecisiontoignorecorroborativeevidence(becauseofbiasconcerns)ultimatelyresultedin

thechoiceofaconservativelinkingstrategy.Thefinallinkageratesweremodest,butwefelt

thiswasnecessaryinordertoachieve(relatively)lowfalsepositiverates.Althoughwedidnot

possessa“truth”sampleforverification,indirectevidenceindicatedwehadrelativelylowfalse

positiverates.Forexample,ifweindependentlylinkedtwobrotherswhowereco-residentin

the1880census,rarelyweretheyalsonotco-residentin1870(i.e.,setsofbrothercamefrom

thesamehouseholdsinbothcensusyears).Anotherexamplewouldbeconsistencyinour

male-onlyandcouple-onlylinkedsamples;ifamalefromthe1880censuswaslinkedinbothof

thesesamples,werarelyhadthisindividuallinkedtotwodifferentrecordsinthe1870census.3

Bothofthesediagnosticsofferevidenceofconsistencyandindirectlyimplyprecision.Theyalso

cherry-pickabit,inthattheselecteduniversewasnative-bornwhitesin1880;itislikelythat

errorrateswerehigherforAfricanAmericansandtheforeign-born(specificallytheIrish).Itis

1Hacker,J.David.PrincipalInvestigator."ModelsofDemographicandHealthChangesFollowingMilitaryConflict"1R01HD082120-01.NationalInstituteofChildHealth/HumanDevelopment.2Ruggles,Steven.PrincipalInvestigator."PopulationDatabasefortheUnitedStatesin1880."R01HD39327,NICHD-DBSB.3RonaldGoeken,LapHuynh,T.A.LynchandRebeccaVick,“NewMethodsofCensusRecordLinking,”HistoricalMethods:AJournalofQuantitativeandInterdisciplinaryHistory,volume44,issue1,2011.StevenRuggles,“LinkingHistoricalCensuses:ANewApproach,”HistoryandComputing,volume14,March2002,pp.213-244.

Page 4: Evaluating the Accuracy of Linked U. S. Census Data: A

3

alsoprobablethatsomedemographicsub-groupsmightbemorelikelytohaveconsistent

informationinsuccessivecensusesandthusbemorelikelytobeaccuratelylinked(andthis

wouldprobablyapplytomarriedmenandchildren).Wewerealsomoreconfidentinour1870-

1880linkedsamplecomparedtolinkedsampleswithinter-censalgapsexceedingtenyears(i.e.,

weexpectfalsepositiveratestoincreaseastheyearsbetweenlinkedcensusestoincrease).

Anotherreasonwethoughtwehadrelativelylowfalsepositiverates(atleastformarriedmen

andsons)wasbecausewespentsometimevisuallyevaluatingthelinkedhouseholds.Although

welinkedontheindividualbasis(primarylinks),theresultinglinkeddataconsistsofthe

primarylinksalongwiththeirco-residenthouseholdmembersfromthetwospecificcensuses.

Ifanyofthenon-primaryrecordsappearedtobethesamepersonintherespectivecensuses,

thenweestablishedthelinkbasedonasetofrules(withtheselinkedrecordsidentifiedas

secondarylinks).4

Manyofthe1870-1880primarylinksdonothaveco-residentsecondarylinksforobvious

reasons;anexamplewouldbea24-year-oldsonlivingwithhisparentsin1870linkedtoa34-

year-oldhouseholdheadlivingwithhiswifeandthree-year-olddaughterin1880.Butmanyof

theprimarylinkshaveco-residentsecondarylinks;inthemale1870-1880linkedsample28

percentoftheprimarylinkshavenosecondarylinks,19percenthaveoneand53percenthave

twoormoresecondarylinks.Althoughwedidnotdoasystematicanalysis,itispossibletopick

outlow-qualityprimarylinks,andthetoppanelinFigure1givesanexample.Heretheprimary

linkisHenryMcHugh,age14in1870andage25in1880.However,nootherrecordineither

householdappearstobethesameperson(withtheonlyrealpossibilitybeingJamesE,age3in

1870andEdwardJ.,age12in1880).Butanexamplelikethisisrelativelyrareinthe1870-1880

malelinkedsample.Muchmorecommonwouldbethelinkedrecordsinthesecondpanel.

HeretheprimarylinkisJamesFelkins,age61in1870(linkedtoJamesH.Felkin,age71in

1880).AndthreeofJames’kinaresecondarylinksandtheyappeartobecorrectlylinked

despitethedifferencesinexpectedage.Infact,thereisalsoahighprobabilitythatMartha,age

53in1870isthecorrectlinktoMatilda,age67in1880.

4Seehttps://usa.ipums.org/usa/linked_data_samples.shtml.

Page 5: Evaluating the Accuracy of Linked U. S. Census Data: A

4

WhetherMarthaisactuallyMatildaillustratesabasicdilemmawithestablishingsomeofthe

secondarylinks;theycouldbethesameperson,butmaybeorprobablynot(itisdefinitely

possiblethatJamesre-marriedtoMatildaatsomepointbetween1870and1880).Butdespite

asomewhatconservativestandardforestablishingsecondarylinksincasesofambiguity,the

secondarylinkshadlowerlevelsofsimilaritycomparedtoourprimarylinks.Forexample,inthe

1870-1880malefile,lessthan1percentoftheprimarylinkshaveanexpectedagedifference

exceedingoneyearofage.Forsecondarylinks,over20percenthaveanexpectagedifference

oftwoyearsofageormore.

Thehigherprecisionforourprimarylinksresultedfromourconservativelinkagestrategy.To

useasimplifiedexample,ifwehadtwopotentiallinksforagivenrecord,withonepotential

linkbeinganexactmatchonalllinkagevariablesandtheotherbeinganexactmatchonall

variableswiththeexceptionofanexpectedagedifferenceoffouryears,wewouldrejectboth

potentiallinksbecauseofambiguity(yes,thepotentiallinkwithanexactagematchwouldhave

ahigherprobabilityofbeingthetruelink,butwetookaconservativelinkingapproach).In

addition,ifouronlypotentiallinkwasanexactmatchexceptforanexpectedagedifferenceof

fouryears,wewouldrejectbecauseoflowprecision.Inotherwords,wehadatwo-threshold

approach,withthehigherthresholddeterminingeligibilitytobeaprimarylink,andthelower

thresholdidentifyingtheareaofambiguity;alinkwasdefinedasone-and-only-onepotential

linkabovethehigherthreshold,andnootherpotentiallinksabovethelowerthreshold.This

resultedinfairlyaccurateresults,butalsomeantthatourprimarylinkswerenotrepresentative

ofalltruelinks.Thisfinding,alongwiththeunderstandingthatmanyprimarylinkscanbe

verifiedthroughthepresenceofconsistentco-residentkininbothcensusyears,were

importantinsights,butwereallydidnotappreciatethisuntilwewerefinishedwiththe1880

complete-countlinkageproject.

LinkingSlave-Ownerstothe1850Complete-CountPopulationDatabase

Ournextlinkageprojectwasthe1850complete-countdatabaseofthe1850U.S.Census,

Page 6: Evaluating the Accuracy of Linked U. S. Census Data: A

5

whichwasacollaborationwiththeChurchofJesusChristofLatter-DaySaints(LDS).5Inaddition

tothepopulationrecords,LDShadalsoenteredthe1850slaveschedules.Theslavecensushas

theslaveownernamesandwewantedtolinktheslaveowners(andtheirslaves)totheslave

owner’spopulationrecord.Thepopulationandslaveenumerationsweredonesimultaneously,

soslaveownersintheslaveschedulesandthepopulationschedulesshouldbe(roughly)inthe

sameorderintheirrespectivedatabases.However,itbecameapparentthatsomeslave

schedulepagesweremicrofilmedoutoftheiroriginalorder(andtherearenopagenumbersor

enumeratorsequencenumberstoverifythesort;thepageshaveanenumerationdatefield,

butthisinformationwasoftenmissingandwasnotconsideredtobeincrediblyreliable).The

formsdonothaveinformationforslaveownerage,birthplaceorsex(andabout20percentof

slaveownersin1850werefemale).Theonlyowner-relatedinformationontheslaveschedules

isslaveownername,buttheformshavelegibility(andtranscription)issuesandgivenname

oftenconsistsofasingleinitial.

Herewewerenotconcernedaboutbiasinlinkagemethods;thegoalwastoaccuratelylinkall

oftheslaveownerstotheirrespectivepopulationrecords.Thebasicrulewasthatslaveowners

andtheirpopulationrecordwouldusuallyfollowapproximatelythesamesequenceinboth

schedules(withsomeexceptionsduetoabsenteeslaveowners).Butwehadtoidentifymini-

sequenceswithincountieswhentheslaveschedulepageswereoutoforder.Todothiswe

blockedbycountyofresidenceandrestrictedpotentiallinkstorecordsage17andolderinthe

populationdata,andwroteoutpotentiallinksthatexceededapresetthresholdforgivenand

surnamesimilarity.

Thecreationofslaveownersequencesreliedonidentifyingclustersofpotentiallinks(i.e.,a

highproportionofslaveownersfromaslavepageorrangeofslavepagesthathadpotential

linkstoagivenpageorrangeofpagesinthepopulationdata).Figure2showsan1850slave

schedulepage;aslavepagehas84linesforindividualslaves,andthispagehas14slave

holdings(i.e.,14slaveowners).Thepopulationscheduleshave42linesperpagein1850and

WashingtonCounty,Missourihadafreepopulationof7,736in1850;thusWashingtonCounty,5Alexander,JosephTrent.PrincipalInvestigator."BaselineMicrodataforAnalysisofU.S.DemographicChange.PRF601864.NationalInstituteofChildHealth/HumanDevelopment.

Page 7: Evaluating the Accuracy of Linked U. S. Census Data: A

6

Missourihadapproximately190pagesofpopulationdatain1850.Again,theonlyinformation

usedtoestablishthelinkisname(i.e.,wedonothaveage,birthplaceorsexfortheslave

owners).Thebasicconceptwasthatthe14slaveownerswouldhaverandompotentiallinks

dispersedovertheentirecounty(prettymuchanywhereonpages1through190inthe

populationdataforthisexample).Buttypicallywecouldidentifyaclusterofpotentiallinksona

givenpageorrangeofpagesinthepopulationdata;probablynotall14,butwewouldsee

clusters,whichwouldindicatethatthesepotentiallinkswerethetruelink(evenifanother

potentiallinkelsewhereinthecountyhadgreaternamesimilarity;inotherwords,sequence

orderwasoftenabetterpredictorofthetruelinkthannamesimilarity).

Weeventuallybegantounderstandthatwecouldapplytheslaveownersequencinglogicto

linkingthepopulationrecordstakenfromtwodistinctcensusesonthehouseholdbasis.

Basically,ahouseholdisasubsetofapageofpopulationdata.Andhouseholdmembersare

similartoagroupofslaveownersonagivenpageofslavedata.Theanalogybreaksdownabit

whendealingwithindividualsenumeratedtenyearsapart.However,asmentionedabove,

undercertaincircumstanceswewouldexpectsomeco-residentialstability.Basically,ifwefind

certaincombinationsofnuclearkinagetenandolderco-residinginagivencensusyear,thereis

averyhighprobabilitytheywerealsoco-residingtenyearsearlier.Forexample,the

expectationisthatahouseholdhead,spouseandtwoteen-agedsonsinthe1880censuswill

alsohavebeenenumeratedtogetherinthesamehouseholdinthe1870census.Atthe

individualleveleachofthefourrecordscouldhavemultiplepotentiallinkstothe1870census,

butthetruelinkwouldbeidentifiablebecauseitwouldbethehouseholdcombinationthatalso

hadpotentiallinksforothermembersofthehousehold.Again,thecorrecthouseholdmight

nothavepotentiallinkstoallfour,butthreeoutoffourprobablywouldbeenoughtoestablish

andconfirmthelink.

HouseholdLinkingtheTwoEnumerationsofSt.Louisin1880

Oneissuewiththisapproachisthelargenumberofindividualpotentiallinksthatneedtobe

generatedinordertoestablishthehouseholdlinks.Mostofourpotentiallinkswillbetheonly

Page 8: Evaluating the Accuracy of Linked U. S. Census Data: A

7

linkbetweenspecifichouseholdsintwodifferentcensuses(andwillnotbeatruelink),butwe

havenowayofknowingthisuntilwegenerateandprocessallofthepotentiallinks.And

workingwiththecomplete-counttabulationswouldrequireimprovementsinourprocessing

speed.

Wealsohadtodevelopanactualprocess,whichevolvedduringworkwedidlinkingthetwo

enumerationsofSt.Louisin1880.ThefirstenumerationoccurredinJuneand,becauseof

allegationsofanundercount,theCensusOfficeauthorizedasecondenumerationinNovember

ofthesameyear.ThiswasnotthefirsttimethatanAmericancitywouldbere-enumerated,

norwoulditbethelast.6ButSt.Louisin1880appearstobeuniqueinthatthesecond

enumerationwasanattemptatacompletere-enactment;thesameenumerationsheetswere

usedinbothenumerationsandenumeratorswereexpectedtocompleteallofthecensus

questions.7BothenumerationsalsousedthesameJune1referencedate.Theenumerator

instructionsfortheNovemberrecountstatethat“enumeratorswillnotaskthepeopleoftheir

districtwhethertheyhavechangedtheirresidencesinceJune1,1880,buttheymustask,

“WereyouresidentsofSt.Louisonthe1stofJune?”or,“WasSt.Louisyourhomeonthe1stof

June,1880?…enumeratorswillmakenoinquiriesastoremovalsfromonefamilytoanother,

andfromonedistricttoanothersinceJune1(assuggestedinmycircular);buttheymustbe

veryparticulartoask,“HasanymemberofthisfamilyorhouseholdleftthecitysinceJune1,

1880?”and“HasanypersonorfamilymovedfromthecityfromthisneighborhoodsinceJune

1,1880?”8

TheuseoftheJune1referencedatefortheNovemberenumerationraisesanumberofissues

regardingtheaccuracyoftheresults.TheenumerationofindividualswhowerepresentonJune

1buthadsubsequentlyleftthecitywoulddependonrelatives,neighborsorlandlords

reportingthisinformationtoenumeratorsaswellasgivingtheminformationonthemigrants’

individualcharacteristics.Enumeratorswerealwaysdealingwiththeseissues,andgettingfairly

6 FrancisA.Walker,ACompendiumoftheNinthCensus(Washington,D.C.:GPO,1870),pp.xx-xxi.7 Forexample,NewYorkCityandPhiladelphiahadrecountsin1870.Inbothcasesenumeratorswereonlyexpectedtofillinasubsetofthequestionsontheoriginalenumeratorsheets.8“TheCensus:RevisedInstructionsIssuedtotheEnumerators--OneDistrictAlreadyFinished,”St.LouisPostDispatch,November9,1880.

Page 9: Evaluating the Accuracy of Linked U. S. Census Data: A

8

accurateinformationonabsenteeresidentswouldnotbeaninsurmountabledifficultyifthe

respondenthadsomefamiliaritywiththeabsentees.Butthefive-monthgapbetweenthe

referencedateandtheactualenumerationwouldmakeitdifficulttogetanexactcountand

preciseinformationonrelativelytransientpopulationsub-groups:extendedkinandunrelated

individualsingeneral,andthoseresidinginhotelsandlargerroomingandlodging

establishmentsmorespecifically.Butthisshouldnotaffectourabilitytolinkthedata.In

contrasttorecordstakenfromtwoseparatedecennialcensuses,thetwoenumerationsofSt.

Louisconstitutearelativelycloseduniverse;weexpecttofindthesameindividualslivingwith

eachother.Inaddition,wehavestreetaddressesforbothenumerations.Althoughsome

individualswouldrelocate(withinthecity)betweenthetwoenumerations,theaddresses

wouldproveusefulinthelinkingprocess.Theuseofcorroborativeevidenceintheformofco-

residentkinandstreetaddressundoubtedlyproducesbiasedlinkageresults.Butthisissueis

notimportantherebecauseourgoalistolink,totheextentpossible,alloftherecords.

Ourlinkageapproachconsistsofinitiallyestablishingpotentiallinksattheindividuallevel.

Namesarecleaned(i.e.,non-alphacharactersareremoved)andparsed(i.e.,thegivenname

‘MaryE’becomesname1=‘Mary’andname2=‘E’).Recordsareblockedbysexandsimilarity

scoresbasedontheJaro-Winkleralgorithmarecalculatedforgivennameandsurname.9

Recordpairshavingasurnamesimilarityscoreofatleast0.9,agivennamesimilarityscoreofat

least0.7,andanabsoluteagedifferenceoflessthanfiveyearsareselectedaspotentiallinks.

Wedidnotstandardizegivennames,nordidweusebirthplaceorraceasablockingfactor.

Somenamestandardsarefairlyobvious,butwedecidedtoempiricallydeterminethe

appropriatestandardsbasedonourinitiallinksratherthanimposestandardsbasedon

assumptions.Wehopedtousestreetaddresstofacilitatethelinking,butourinitialattemptsto

linkonthebasisofmatchingstreetandhousenumberinformationproducedrelativelyfew

qualitymatches.Inaddition,wehaveenumerationdistrictinformation,buttherewere168

enumerationdistrictsinthefirstenumerationcomparedto450inthesecond.Forthatreason

weinitiallydidnotusedistrictinformationtolinkrecords.

9PeterChristen,DataMatching:ConceptsandTechniquesforRecordLinkage,EntityResolution,andDuplicateDetection,Springer,2012.http://link.springer.com/book/10.1007%2F978-3-642-31164-2.

Page 10: Evaluating the Accuracy of Linked U. S. Census Data: A

9

AlthoughwefindfarfewerexactornearduplicatesinSt.Louisthanwewouldifweweretrying

tolinktheentirecountry,wenonethelessencounterafairamountofambiguitywhenlooking

atpotentiallinksontheindividuallevel.Muchofthisambiguityiseliminatedifwetakeinto

accountcharacteristicsofco-residentfamilymembers.Forexample,inthefirstenumeration

wehavea‘JohnO’Donnell’whowas43yearsold.Restrictingthepotentiallinkstoexactname

matchesandamaximumagedifferenceoffouryears,wehavethreemennamedJohn

O’Donnellinthesecondenumerationwithagesof45,45and46(seeFigure3).Weknowthat

the43-year-oldinthefirstenumerationisactuallythe46-year-oldinthesecondenumeration

afterwetakeintoaccountinformationfromotherhouseholdmembers.

Ratherthancreatingvariablesforeachindividualpertainingtoinformationgleanedfromco-

residentkin(e.g.,father'sname,father'sage,mother’sname,mother’sage,etc.)wecreate

potentiallinksforeachindividualusingthesimplemethodoutlinedabove.Thenwesumthe

numberofpotentiallinksbetweenspecifichouseholdsinthetwoenumerations.Usingthe

O’Donnellexample,eachhouseholdmemberinthefirstenumerationhasnumerouslinksto

individualrecordsinthesecondenumeration.Formostofthesepotentiallinks,however,only

oneofthehouseholdmembershasalinktoaspecifichouseholdinthesecondenumeration.

DespitetheinconsistentageforJohnO’Donnellinthetwoenumerations(age43and46),we

knowthatthisisthecorrectlinkafterdeterminingthathisspouseandchildrenalsohave

potentiallinksbetweenthesetwohouseholds.

Thisprocessalsoallowsustoestablishlinksevenifsomeofthehouseholdmembersdonot

havepotentiallinksinourinitiallinkingpass(seeFigure4).Inthishouseholdthefirstandthird

membersofthetwohouseholdswerenotinourinitialpotentiallinksfilebecauseoflowgiven

namesimilarity(Autonia-AntonhasaJaro-Winklersimilarityscoreof0.699,whichisbelowthe

0.7threshold)andexcessiveenumeratedagedifference(Anniewas19yearsoldinthefirst

enumerationand14yearsoldinthesecondenumeration).However,afterdeterminingthat

therearefourotherlinksbetweenthesehouseholds,wecanalsoestablishlinksfortherecords

thatwerenotinitiallylinkedontheindividualbasis.

Page 11: Evaluating the Accuracy of Linked U. S. Census Data: A

10

Weestablishedlinksbetweenhouseholdsbasedonthefollowingrules.First,ifwehavefouror

morepotentiallinksbetweenspecifichouseholdsinthetwoenumerations,andeachofthe

householdswithfourormorepotentiallinksdidnothavetwoormorelinkstoanyother

household(intheotherenumeration),thenweflaggeditasalinkedhousehold.Second,we

alsoacceptedhouseholdswiththreepotentiallinks,ifneitherofthesehouseholdshadtwo

linkstoanyotherhousehold.Finally,wereviewedourworkbyvisuallyinspectingthe

householdswiththelowestcompositesimilarityfornamesandageorlinkedhouseholdswitha

majorityofhouseholdmembersunlinkedattheindividuallevel.Usingthisapproachwewere

abletolinkaboutonethirdofthefirstenumerationhouseholds;21,214outof63,325

householdsand99,147outof276,683relatedindividuals.

Thismethodonlyworksonrelatedindividualsandwillnotlinksmallerhouseholds.However,

afterestablishinghighqualitylinkedhouseholds,wesetthemasideandmadeadditionalpasses

throughthedata(discussedbelow).Wealsousedthevisualreviewprocesstoassesswhymany

householdsremainedunlinked.Aprimaryreasonwaslowerlevelsofsurnamesimilarityfor

unlinkedhouseholds,whilesomesmallerhouseholdswereunlinkedbecauseoftheuseof

diminutivesorabbreviationsforgivennamesinoneenumerationortheother.Wealsobegan

toexplorewaystouseplaceofresidencetoeitherverifyorlinkhouseholdswithrelativelylow

similarity.Forexample,somelinkedhouseholdshadstreetagreement,buttheirhousenumber

wasoffslightly(e.g.,2402MarketSt.inoneenumerationversus2404MarketSt.intheother).

Inaddition,someofourinitialsetofhouseholdlinkshadhousenumberagreement,butthe

streetnamedisagreed.Anexaminationofthelinkedhouseholdsidentifiedmanystreetname

correctionsandwewerealsoabletoconstructanenumerationdistricttranslationtable

betweenthetwoenumerations.Althoughmanylinkedhouseholdshadstreetaddress

disagreement,almostallofthelinkedhouseholdsthathadidenticaladdressinformation

residedinoneofasetofcontiguouslynumbereddistrictsinthesecondenumeration

correspondingtoasingledistrictinthefirstenumeration.Thecorrectiontoaddressesandthe

useoftheenumerationdistrictequivalentsallowedustolinkhouseholdsthathadbeen

difficulttolinkbecauseoftheirsmallsizeorbecauseoflowsurnamesimilarity.

Page 12: Evaluating the Accuracy of Linked U. S. Census Data: A

11

Asecondgroupofpotentiallinkswasgeneratedusingthesamethresholdsusedintheinitial

pass,exceptweloweredthesurnamethresholdtoaJaro-Winklerscoreof0.7andapplied

someempirically-derivednamestandardstothegivennames.Wethengeneratedasecond

batchofhouseholdlinksusingrulesbasedonthenumberofpotentiallinksbetweenspecific

householdsinthetwoenumerations.Afteridentifyinghigherqualityhouseholdlinks,the

householdlinkingrulesallowedlessprecisioniftherewassomeevidenceofresidential

persistence;eitheridenticaladdressinformationorsimilaraddressandresidinginthesame

enumerationdistrictequivalent.

Figure5showstwoexamplesoflinkedhouseholdswithsurnamesimilaritybelowourinitial

thresholdof0.9.ThesurnamecombinationofBurgherdt-BurkhartgeneratesaJaro-Winkler

scoreof0.86,alevelgenerallysufficienttoestablishalinkifotherlinkingvariablesalsohad

relativelyhighsimilarity.And,afterlookingattheentirehousehold,itisobviousthatthese

householdswerecorrectlylinked.ThesecondhouseholdinFigure5isalsolinked,buthasa

surnamesimilarityof0.67.Herewesuspectthatanindividuallinkwiththesurname

combinationofFitzgerald-Vetzgurawouldberejectedbymostclassifiers.Afterlookingatthe

householdcomposition,however,weconcludethatthesearethesamepeople.Anydoubtsare

alleviatedbylookingatthehouseholdhead’soccupation(“stonemason”inboth

enumerations)andstreetaddress(thehouseholdwasenumeratedat2405DivisionStreetin

bothenumerations).

Occupationalinformationwasneverexplicitlyusedtoestablishlinks.Butwebegantouse

streetaddressandenumerationdistrictinformationtolinkhouseholds,andthiswasusefulin

establishinglinksbetweensmallerhouseholds(especiallyone-andtwo-personhouseholds).

ThebottomtwolinkedhouseholdsinFigure5giveacoupleofexamples.Thefirsthousehold

hasasurnamesimilarityof0.63,andlinkingisfurthercomplicatedbythehead’sgivenname

(Frankvs.F.H.inthetwoenumerations).However,thesehouseholdswereenumeratedatthe

sameaddress,andareverylikelythesamepeople(withadditionalcorroborationprovidedby

theheadhavingtheoccupationof‘RetailGrocer’inbothenumerations).Thesecondlinked

householdinFigure5hashighersurnamesimilarity(0.85)butlinkingiscomplicatedbythe

Page 13: Evaluating the Accuracy of Linked U. S. Census Data: A

12

head’sgivenname(Carolinevs.Catherineinthetwoenumerations).Althoughtheydonothave

identicalstreetinformation(4thStreetvs.5thStreet)theydohaveidenticalhousenumber

informationandwereenumeratedinthesameenumerationdistrictequivalents,whichwas

enoughofatelltoestablishthelink(wealsohaveoccupationalsimilarityforthehead’s

occupation:Caroline’slistedoccupationwas“KeepMillineryStore”andCatherinewasa

“Milliner”).

Therules-basedsystem,withitsshiftingthresholdsandmanualintervention,undoubtedly

introducesbias.However,weareprimarilyinterestedinmaximizingthenumberoflinksand

makingsurethattheyarecorrectlinks.Althoughwehavenotfinishedourworklinkingthe

relatedindividuals,Table1showsthatwehaveestablishedlinksfor78percentofthe

householdsinthefirstenumerationand74percentofthehouseholdsinthesecond

enumeration,whichcorrespondsto80percentoftherelatedindividualsinthefirst

enumerationand76percentoftherelatedindividualsinthesecond.Asexpected,givenour

householdlinkingapproach,wehavemoresuccesslinkinghouseholdsthatcontainmore

relatedindividuals.Someofthecurrentlyunlinkedhouseholdscannotbelinkedbecausethe

householdismissingfromoneenumerationortheother.Weanticipate,however,increasing

ourlinkageratethroughtrialanderrorandtheprocessofelimination.Someoftheunlinked

householdshavesurnamesimilaritybelowthethresholdsusedthusfar,andwecontinueto

modifyourrulestolinkthesmallerhouseholds.Inaddition,inthefuturewewillattempttolink

theunrelatedpopulation,althoughwesuspectthatmanyoftheboardersandlodgerswillbe

unlinkableduetotheabsenceofcorroborativeinformationsuppliedbyco-residentkin.

Table2agivesthelinkedpopulation’sdistributionbysurnamesimilaritymeasures.Athigher

levelsofsimilaritywewouldtypicallyassumeapotentiallinkwiththatcombinationof

surnameswouldbeatruelinkgivensufficientsimilarityforotherlinkagevariables(e.g.,given

name,age,birthplace,andsex).Thisassumptionbeginstobreakdownasweseelesssimilarity

inthesurnamecombinations.Figure6givesexamplesofsurnamecombinationsfromtheSt.

LouislinkedrecordsalongwiththeJaro-Winklerscore,phoneticcodesandmatchedletter

metrics.Thereisnoabsoluterulefordecidingatwhatpointthesimilaritybetweensetsof

Page 14: Evaluating the Accuracy of Linked U. S. Census Data: A

13

linkedsurnamestransitionsfrom“plausible”to“maybe”to“doubtful.”BasedonFigure6,the

transitionfrom“maybe”to“doubtful”probablybeginsaround0.8Jaro-Winklersimilarity.And

thismeansover11percentofourlinkswouldbetreatedwithafairlyhighlevelofscepticism

withoutthecorroborativeinformationfromotherco-residenthouseholdmembers(or

consistentplaceofresidenceinformation).

InadditiontoJaro-Winklerscore,Table2agivessurnamematchingratesfortwophoneticcode

algorithms,NYSIISanddoublemetaphone.Wealsoconstructmeasuresindicatingwhetherthe

firstletter,thefirsttwoletters,andthefirstthreelettersofasurnamematchforourlinked

records.Almosthalfofthelinkedrecordsareperfectmatches,andallofthesewouldalsobe

consideredmatchesusingthephoneticcodesandmatchingletterstechniques.However,for

linkedrecordswithnon-exactmatchesforsurnamebutaJaro-Winklerscoregreaterthan0.95,

66percentwouldbeamatchusingNYSIISand74percentwouldbeamatchusing

doublemetaphone.10Overall,69percentofthesurnamecombinationsofthelinkedpopulation

haveaNYSIISmatchcomparedto73percentfordoublemetaphone.Over93percentofthe

linkedsurnamecombinationsmatchonthefirstletter,with80and71percentmatchingonthe

firsttwoandfirstthreeletters.

ThesecondpanelinTable2bshowsthedistributionbyJaro-Winklerscoreforgivennames.

Withgivennames,wearemoreconcernedwithstandardizingabbreviationsanddiminutives

thanwithwhetherwecanmatchdissimilarcombinationswithphoneticcodes.Thedistribution

oflinkedrecordsthathaveperfectgivennamesimilarityis53percent,withanother2.7

percenthavingasingleinitialmatchingthefirstletterofafullgivenname.Thisleavesover44

percentofthelinkswithlessthanperfectsimilarity.However,weconstructednamestandards

afterexaminingcombinationsofnon-identicalgivennamecombinationsinourlinkeddata.In

additiontothe53.8%oflinkswithanexactnamescore,another25%receiveanexactscore

afterstandardization.

10https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System;https://en.wikipedia.org/wiki/Metaphone;http://www.b-eye-network.com/view/1596;https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Page 15: Evaluating the Accuracy of Linked U. S. Census Data: A

14

Theoverallimprecisioningivennamesisalsodrivenbythefactthatsomeofourlinkedrecords

havedistinctlydifferentgivennamesinthetwoenumerations.Figure7givesafewexampleof

these.Thefirstsetshowslinkedrecordsthatwouldhaveagivennamematchifwecompared

firstnamestomiddlenames.Thesecondsetconsistsofexampleswherethegivenname

matchesamiddleinitialforthelinkedrecord(e.g.,the“N”in“BayardN”probablystandsfor

“Nelson”).Nonetheless,thethirdsetoflinkedrecordshavelittleornosimilarityintheirgiven

names,nordotheyhavemiddleinitialsthatmatchagivenname.Possibleexplanationsfor

givennameinconsistencywouldincludechangingpersonalpreferences,respondentbias,

enumeratorerror,andtranscriptionerror.

Table3givesthedistributionofageprecisionforourlinkedrecords.Ifenumeratorswere

givingarespondent’sageasoftheNovemberenumeration(ratherthanageonJune1st)then

beingayearolderinthesecondenumerationwouldbeconsideredagoodorperfectmatch.

Beingayearoffintheotherdirectionwouldalsobeconsideredagoodmatchifwewerelinking

acrossdifferentdecennialcensuses.Butthatwouldstillleaveover16percentofourlinked

recordswithanagedifferenceoftwoormoreyears.Somerespondentsmaynothaveknown

theirtrueage,andtheirresponsetoenumeratorsmayhavebeensomewhatrandom.Someof

theimprecisioniscausedbyrespondentbias,thatco-residentkinorevenneighborsmighthave

beensupplyinginformationtoagivenenumerator.Transcriptionerrorwouldalsocontribute

here.Regardlessofthesourceoftheerror,wesuspectthatagedifferencesintruelinksfound

intwodifferent19thcenturyU.S.censuseswouldhavesimilar(orpossiblyhigher)ratesof

imprecision.11

Thetablealsogivesthesomewhatsurprisinglyhighlevelsofsexerrorsinourlinkeddata,

wherealmostonepercentofthelinkedrecordshaveasexmismatch.Althoughwedidminimal

blockinginlinkingthetwoenumerations,wedidblockbysex.Afterestablishinglinksbetween

households,weoftenhaveremainingunlinkedrelatedhouseholdmembersinthehouseholdin

11PeterR.Knights,“AccuracyofAgeReportingintheManuscriptFederalCensusof1850and1860,”HistoricalMethodsNewsletter,Vol.4,Issue3,1971.RonaldGoeken,LapHuynh,T.A.LynchandRebeccaVick,“NewMethodsofCensusRecordLinking,HistoricalMethods:AJournalofQuantitativeandInterdisciplinaryHistory,Vol.44,Issue1,January2011.

Page 16: Evaluating the Accuracy of Linked U. S. Census Data: A

15

bothenumerations.Weautomateaforcingproceduretolinktheserecords(ifpossible).We

evaluatedtheresultsthroughclericalreview,andintheprocessfoundmanyhouseholdswitha

singleunlinkedrecordinbothenumerationsthatwasverysimilarwiththeexceptionofasex

conflict.Theserecordstendedtobeyoungerindividuals,andoftenhadgivennamesthatwere

genderedequivalents(e.g.,JosephinetoJoseph,AugustatoAugust,andJuliatoJulius).Itis

possiblethatintheabsenceofadeclarationofgenderonthepartoftherespondent,infants

andsmallchildrenwouldnothavebeeneasilyidentifiedbytheenumeratorasmaleorfemale.

Thisalsoreflectstheoralnatureofthecensus;enumeratorsrecordedwhattheythoughtthey

hadheard.

Table3givesplaceofbirthandraceconsistencyforthelinkedrecords.Thereportingofthe

racevariablewasrelativelyconsistent,especiallyaftertakingintoaccountinconsistencyinthe

blackandmulattocategories.Only0.2percentofthelinkedrecordsgofromwhiteto

black/mulatto(orviceversa).Incontrast,over8percentofourlinkedrecordshavemismatched

birthplacesandover18percenthavemismatchesonparentalbirthplaces.Thedisagreement

rategoesdownquiteabitifwecombineallU.S.birthplacesintoasinglecategoryanddothe

samefortheforeignborn.Butevenusingthisconservativemeasure,1.3percentofourlinked

recordshaveaU.S.birthplaceinthefirstenumerationandaforeignbirthplaceinthesecond

enumeration,and1.2percenthaveaforeignbirthplaceinthefirstenumerationandaU.S.

birthplaceinthesecondenumeration.

OurevaluationoflinkagevariableprecisionfortheSt.Louisdataispreliminary,sincewehave

notfinishedlinkingthetwoenumerations.Theoverallimpressionatthispointisthata

significantnumberofthelinkedrecordswouldnotbelinkableattheindividuallevelbecauseof

lowsimilarity.Theonlywaywewereabletolinksomeofthehouseholdswasbyusingaddress

informationalongwiththeassumptionthatthetwoenumerationswerearelativelyclosed

universe.

HouseholdLinkingtheComplete-Count1870and1880U.S.Censuses

Page 17: Evaluating the Accuracy of Linked U. S. Census Data: A

16

WesuspendedtheSt.Louislinkageprojectinlate2016(althoughweanticipatefinalizingthe

linkingatsomepoint).WeinitiallyhopedtousetheSt.Louislinkeddatatotrainindividual-level

classifiersthatwewouldusetolinkthevarious19thcenturyU.S.censuses.Onereasonwhy

thismightnotbeagreatideaisthatthehighlevelsofimprecisionfoundintheSt.Louislinked

datamightnotberepresentativeofwhatwewouldfindinthepopulationofalltruelinksfound

inthedecennialcensuses.Thisisbasicallyanissueofwhetherornotthetwoenumerationsof

St.Louiswereofatypicalpoorquality.Wehavenowayofdirectlyansweringthisquestion;we

suspectthatoveralltheaccuracy(orconsistency)foundinthe19thcenturyU.S.censuseswas

lessthanideal.TherelativelackofprecisioninthelinkedSt.Louisdatacouldbeaworsecase

example,butitcouldalsobewhatwewouldtypicallyexpectinenumerationsoflargeAmerican

citiesinthe19thcentury.

GivenconcernsaboutusingtheSt.Louislinkeddataastrainingdata,wedecidedtoapplythe

householdlinkingprocesstothecomplete-countdecennialcensuses.Itwasunclearhowmany

householdswewouldbeabletolink,butwewereconfidentthatitwouldbeasufficient

numbertotrainandtestindividual-levelclassifiers.Wewouldalsobeabletoconstructfalse

positiveestimatesbasedonverifiedlinks(atleastfortheproportionofthepopulationthatwe

wouldlinkandconfirmviathehouseholdlinkingprocess).

Theonlyrealimpedimenttoapplyingthehouseholdlinkingprocesstothecomplete-count

tabulationsistherelativesizeofthedatabases;e.g.,theUnitedStateshadapopulationof38

millionin1870and50millionin1880.Whenwebeganworkonlinkingthe1870and1880

completecountdatabaseslastfallitwastakingatleastaweekofprocessingtimetogeneratea

basicpotentiallinksfile.Earlierthisyear,however,wemadesomeimprovementsandare

currentlyabletogenerateapotentiallinksfilecomparing1870to1880inaboutaday.

Weblockbysexandplaceofbirth.Wewriteoutpotentiallinksifexpectedagedifferenceis

lessorequaltofiveandbothgivenandsurnamesimilarityisgreaterorequalto0.8(Jaro-

Winkler).Ifthegivennameisaninitial(ineitheryear)anditmatchesthefirstletterofthegiven

nameforarecordinthecompareyear(regardlessofwhetheritisaninitialorfullname),then

givennamesimilarityissetat0.8(andisthuseligibletobeincludedinthepotentiallinksfile).

Page 18: Evaluating the Accuracy of Linked U. S. Census Data: A

17

Wealsoapplyarelativelyshortlistofgivennamestandards(basedonourSt.Louishousehold

linkeddata).12Theoutfileconsistsof2.4billionpotentiallinks.13

Atthispointweareonlyinterestedinrecordsthatconstituteacluster;basicallywewantto

examinesetsoftwoormorepotentiallinksbetweenspecifichouseholdsin1870and1880(i.e.,

potentialhouseholdlinks).Thuswefilteroutanypotentiallinkthatisthesolelinkbetween

specific1870and1880households.Thisreducesthefileto79millionindividualpotentiallinks

and38million1870and1880householdcombinations.Althoughthepotentiallinksfileuseda

0.8surnamethreshold,weinitiallyonlyprocessrecordsthathavesurnamesimilarityofatleast

0.9.Thisfurtherreducesthefileto48millionindividualpotentiallinksand21million1870and

1880householdcombinations.Mostofthe1880householdsthatareincludedinthepotential

linksfilehavemultiplehouseholdsin1870aspotentiallinks(e.g.,only10percenthavea

potentiallinktoasinglehouseholdin1870).Atthisstagewehaveambiguitygiventhatweare

usingrelativelylowageandnamesimilaritythresholds,andsomebirthplaceblockscontaina

disproportionatelylargenumberofrecords(e.g.,NewYorkState,Ireland,Germany).Wecould

attempttodisambiguateconflictinglinksbasedoncompositehouseholdageorgivenname

similarity,butwewerefairlyconfidentthatapplyingrulessimilartothoseusedlinkingtheSt.

Louisenumerationsprovideagoodfirstapproximation.

Weworkfromtheperspectiveof1880andcalculatethenumberofindividuallevellinks

betweenaspecific1880householdand1870households(theminimumwillbeapotentiallink12Andwedoafour-waygivennamecomparisonontakethemaximumvalue(i.e.1.70raw/80raw;2.70raw/80std;3.70std/80raw;4.70std/80std.13WeusecustomsoftwarewritteninPythontocomparerecordsbetweencomplete-countdatasets.Developmentofthesoftwareconsiderstheperformanceeffectsoffourmainparameters:I/Otime(includingnetworkcommunication),computetime,memoryconsumption,anddiskspace.Oursoftwarekeepsdataondiskaslongaspossible,onlypullingindatawhenneededandimmediatelywritingitbackouttodiskattheconclusionofprocessing.Thisstrategyrequiresmanymorediskreads/writesthananalternativeapproachthatkeepsdatainmemory,butisrelativelyfault-tolerant,sincethedataareimmediatelypersistedtolong-termstorage.Withextrapreprocessing,useofappropriatesystemcalls,andproperbalancingbetweendatachunksizeandnumberoftasks,I/Otimeisreducedrelativetocomputetime.Randomaccesstothedataisenabledbygeneratinganindexonthedatapriortorunningcomparisonsandprocessingisamortizedacrossmanysmalltasks,severalofwhichcanrunconcurrently.TheauthorsacknowledgetheMinnesotaSupercomputingInstitute(MSI)attheUniversityofMinnesotaforprovidingresourcesthatcontributedtotheresearchresultsreportedwithinthispaper.URL:http://www.msi.umn.edu"

Page 19: Evaluating the Accuracy of Linked U. S. Census Data: A

18

toonehouseholdin1870consistingof2individuallevellinks).An1880householdis

consideredlinkedifithasatleast4individuallinkstoaspecific1870HHandnomorethan2

individuallinkstoanyother1870household.Inaddition,an1880householdwith3individual

linkstoaspecific1870householdandnomorethan1linktoanyother1870householdis

linked.Thisinitialruleestablishes1,553,420householdlinksconsistingof6,473,809individual

links.

Wehavenowayofmeasuringourfalsepositiverate.However,wecanlookforindirect

evidenceintheformofinconsistency.Sincewedonotuseplaceofresidenceinformationto

establishlinks,wecanusethecrudemigrationrate(definedasnotlivinginthesamestateand

countyinbothcensuses)asaproxyforthefalsepositiverate.Inotherwords,weexpecttosee

fairlyconsistentratesoflivinginthesamestateandcountyinourlinkedhouseholdsregardless

ofnon-demographiccharacteristics.Forexample,ageandgenderarelikelytohaveaneffecton

migrationbehavior.Butoverallsimilarityornamecommonnessinourlinkedrecordsshouldnot

havealargeeffectonmigrationbehavior.Allthingsbeingequal,ifalinkedhouseholdresidesin

thesamestateandcountyinbothenumerations,ourconfidencethatthisisatruelink

increases;alinkedrecordthatisalsoanon-migrantisrarelyanerror.However,migrantsare

typicallyamixoftruelinksandfalsepositives.14

Table4givesmigrationstatusforthefirstbatchoflinkedhouseholdsbyvariouslinkage

metrics.Thetoppanelgivesmigrationratesbasedonsurnamesimilarity.Thisisahousehold

measure(andweselectthefirstpotentiallinkwithanuclearrelationshiptorepresentthe

household).Thereappearstobearelationshipbetweensurnamesimilarityandbeinganon-

migrant,althoughtherangeisrelativelysmall.Itispossiblethatmigrantsarelesslikelytohave

theirsurnamesrecordedaccuratelyorconsistently,butitisalsopossiblethatwearemore

likelytohavefalsepositivesassurnamesimilaritydecreases(andthushigherlevelsof

migrationforlinkedrecordswithlowersurnamesimilarityindicateahigherprobabilityoffalse

positivesatlowerlevelsofsurnamesimilarity).

14 Thisarelativeratherthananabsoluterule.SomeAmericancountieshavepopulationsgreaterthanthetotalsfortheleastpopulatedstates.

Page 20: Evaluating the Accuracy of Linked U. S. Census Data: A

19

ThesecondpanelinTable4givesmigrationratesforoverallrecorduniqueness.Weconstructa

uniquenessscorebasedonthenumberofpotentiallinksgeneratedbythegivenpotentiallink

(whichisdictatedbywhetherarecordhasarelativelycommoncombinationofgivenand

surname,butalsobytheoverallsizeoftheirbirthplaceblock).Wetaketheinverseonthe

individuallevel,andcalculatetheaverageforthehousehold.Forexample,ifagivenrecordin

1880hasonlyonepotentiallinkto1870,theindividualscore=1/1(1.0).Ifagivenrecordhas

100potentiallinksinthe1870data,thentheindividualscore=1/100(0.01).Thushigh

householdscoresindicaterelativeuniqueness.Thereappearstobeaclearrelationship

betweenlowerhouseholduniquenessscoresandmigration,althoughtherangeisagain

relativelysmall.Wewouldnotexpectdifferentlevelsofhouseholduniquenesstoaffectthe

decisiontomovebetweencensuses;thusthedifferentialisindicativeofhigherfalsepositive

ratesashouseholduniquenessdecreases.

Thebottompanelgivesthemigrationratesbasedonhowmanyrecordsconstitutethelinked

household.Hereitispossiblethatthedifferentialdoesnotindicatefalsepositives,butrather

indicatesthatsmallerhouseholds(andespeciallyiftheywereyoungercouples)wereinfact

morelikelytomovebetweencensuses.Nonetheless,weanticipatethattherearefalse

positivesinourinitialsetofhouseholdlinks,andthatTable4providescluesaboutwherewe

wouldmostlikelyfindthem;householdlinksbasedontheminimumnumberofindividuallinks,

andthosecomprisedofrelativelycommonrecordsandloweroverallsimilarity(eitheroverall

ageorgivenandsurnamesimilarity).15

Table5givesthehouseholdlinkagerateafterthefirstroundofrules-basedhouseholdlinking.

Weonlylink15percentofall1880households,butmost1880households(52percent)arenot

atriskofbeinglinkedbecausetheycontainfewerthanthreelinkable1880records(with

linkabledefinedasahavinganuclearrelationshiptoheadandbeingatleast10yearsoldin

1880).However,welink32percentoftheeligiblehouseholdsandover40percentofthe

householdscontaining5ormorelinkablerecords.Thetablealsogiveshouseholdlinkagerates

byraceandnativity(basedonthehouseholdhead’sraceandplaceofbirth),withnative-born15OnepossibilityexplainingthedifferentialsinTable4isthatmigrantsaremorelikelytoliveinplaceswheretheoverallenumerationqualityislower;i.e.,urbanandfrontierareas.

Page 21: Evaluating the Accuracy of Linked U. S. Census Data: A

20

whitesthemostlikelytobelinkedundertherules-basedapproach.Wesuspectthatnon-white

groupshaveloweroverallprecision(andpossiblylessstablehouseholds).Theforeign-born

mighthavelowerlinkageratesbecauseofloweroverallprecision(especiallyintherecordingof

surnameinformation),butthelowerlinkageratecouldalsobecausedbythefactthatsomeof

themwerenotpresentintheUnitedStatesin1870(andwedonothaveyearofimmigration

informationinthe19thcenturycensuses).

Overall,the32percenthouseholdlinkagerateispromising.And,basedonourexperience

linkingtheSt.Louisdata,manytruehouseholdlinkswillbefoundifwelowerthesurname

similaritythreshold(fortheinitialpasswesetthethresholdat0.9).Butwealsofeltthatmany

truehouseholdlinkswereinourcurrentpotentiallinksuniverse(i.e.,atthe0.9surnamelevel)

butremainedunlinkedbecauseofambiguity(multipleconflictingpotentialhouseholdlinks)or

becauseoflownumbersoflinkable1880members(threeorfewerpotentiallinksinapotential

householdlink).Anditispreferabletoestablishtheselinksbeforewetrytolinkhouseholds

withlowersurnamesimilarity.

LinkingHouseholdsBasedonEvidenceofCommonNeighbors

Eventuallywewilldevelopmeasurestoidentifythemostsimilarhouseholdincasesof

ambiguity,butaquickanddirtyapproachwouldbetotakethenon-migranthouseholdifthere

aremultiplepotentialhouseholdlinks.However,whilecrudenon-migrationworkswellasa

diagnostictool,itisnotalwaysapreciselinkingvariable.LargeAmericancitiesaretypically

locatedinasinglecounty.Inaddition,forsomesmallstatesahighproportionofallindividuals

borninthestatewillresideinthelargestcityinthatstate(e.g.,BostonMassachusetts;

ProvidenceRhodeIsland;Baltimore,Maryland).Alsosomeethnicitiestendtoclusterinlarge

cities.Forexample,linkinganIrishhouseholdlivinginBostoninboth1870and1880doesnot

providedefinitiveevidencethatthisisthetruelink.

Althoughweplanoncontinuingtoexperimentwiththefollowingapproach,wecurrently

constructameasureofpotentialhouseholdneighbors.Wehave38millionpotentialhousehold

Page 22: Evaluating the Accuracy of Linked U. S. Census Data: A

21

combinationsinourinitialpotentiallinksfileandover99percentaretheonlypotential

householdlinkforthegivencombinationof1870censuspageand1880censuspage(thereare

40linesperpagein1870and50linesperpagein1880).Allthingsbeingequal,thepresenceof

twoormorepotentialhouseholdlinksonthesamepagecombinationswouldincreaseour

confidencethatthesepotentialhouseholdlinksarethetruelinks.Butmanyneighborswillnot

showuponexactlythesamecensuspagecombinationinthetwoenumerations.Typically

householdsenumeratedtenyearsapartwouldnotbeenumeratedintheexactsequenceeven

iftheyhadnotphysicallyrelocated;directevidenceofneighborsdependssomewhaton

whetherornottheenumeratortookthesamerouteintwodifferentenumerations.Butmany

non-moversshouldhavecommonneighborsintheenumerationsregardlessofwhetherornot

theyshowupinthesameexactsequence.

Currentlywecalculatethenumberofpotentialhouseholdlinksforspecificgridsconsistingof

rangesofimagesinthe1870and1880data.Thegridiscalculatedfromtheperspectiveof

specificpotentialhouseholdlinks(thuseachcombinationof1870pageand1880pagewillhave

itsownuniquegrid).Forexample,apotentialhouseholdlinkislocatedonpagexin1870and

pageyin1880.Thegrid(forthispotentialhousehold)isdefinedasxplus/minus10(pages)in

1870andyplus/minus8(pages)in1880(thereare40linesperpagein1870and50linesper

pagein1880;thusthegrid,basedonthisdefinition,consistsofamaximumof840recordsin

1870and850recordsin1880).Andwewanttoknowhowmanyotherpotentialhousehold

linksarepresentinagrid.

Table6givesthedistributionofthepotentialhouseholdlinksbythenumberofpotential

householdneighbors(PHHN)intheirrespectivegrid.Approximately59percentofthetimethe

specificpotentialhouseholdlinkwillbetheonlypotentialhouseholdlinkinthegrid(i.e.,PHHN

=1).Someofthesecouldbetruelinks(ifthehouseholdphysicallymovedbetweencensuses

andthusdoesnothaveanycommonneighbors),butwesuspectthatmostarefalselinks.The

rightsideofthetablegivesthePHHNdistributionfortherules-basedlinks.Thetablealsogives

therelationshipbetweenPHHNandmigrationstatusforourfirstbatchoflinkedhouseholds;

over70percentofourinitialhouseholdlinksaremigrantsiftheyaretheonlypotentiallinkin

Page 23: Evaluating the Accuracy of Linked U. S. Census Data: A

22

theirgrid.Asgridcountincreases,thehouseholdlinksareincreasinglynon-migrants.16

Figure8showsthepotentialhouseholdlinkscontainedinasinglegrid.Thereference

householdishighlighted(the“Turks”),andthisistheonlypotentialhouseholdlinkonthe

specificcombinationof1870pageand1880page.Theirgridisdefinedas1870page+/-10

pagesand1880page+/-8pages,andthereare12otherpotentialhouseholdlinksinthisgrid;

thusthePHHNforthereferencepotentialhouseholdlink(theTurks)is13(andthePHHNfor

otherpotentialhouseholdsinthefigurewillhavedifferentvaluesforPHHNbecausethegrid

movesaswecalculatePHHNforothercombinationsofpages).Thefiguredoesnotcontain

pageinformation,butitdoescontainhouseholdserialnumberinformation.Theserialsforboth

yearsaresettozeroforthereferencehouseholdintheexample(theTurks),withthevaluesfor

otherpotentialhouseholdsequaltothedifferencebetweentheiractualhouseholdserialand

theactualhouseholdserialforthereferencehousehold.Forexample,theKimehouseholdhas

aserial80diff=2,meaningtherewasonehouseholdlocatedbetweentheTurkhouseholdand

theKimehouseholdin1880.For1870thevalueis-10,meaningtherewereninehouseholds

betweentheTurkhouseholdandtheKimehouseholdin1870.

AhighvalueforPHHNtypicallyindicatesthetruehouseholdlink,butweinitiallyexpectedsome

potentialhouseholdlinkstohaverelativelyhighvaluesbutstillbeafalselink.Thuswecombine

thePHHNwiththehouseholduniquenessscorediscussedearlier.Theaverageuniqueness

scoreforahouseholdrangesfrom0to1.0,whichweconverttoaninteger(i.e.,1to100).

ComboscoreistheproductofPHHNandthehouseholduniquenessscore.UsingFigure8asan

example,therangeofPHHNis10to27,therangeofuniquenessscoreis2to40,andtherange

forcomboscoreis26to1040.

Withoutmuchexperimentationwedecidedtocreateanotherbatchoflinkedhouseholdsbased

onthecomboscore.Wealsodecidedtoincludesmallerhouseholds(i.e.,potentialhouseholds

withonlytwopotentiallinks)intheeligibleuniverse.Thusany1880householdnotlinkedinthe

firstpass(rule-based)thathasatleasttwoormorepotentiallinksiseligible.Ifthepotential

16AndthesmallpercentageofpotentialhouseholdlinksthathavehighPHHNandarealsoamigrantareapparentlyresidentsofcountiesthatexperiencedboundarychangesbetween1870and1880.

Page 24: Evaluating the Accuracy of Linked U. S. Census Data: A

23

householdlinkhasthemaximumnumberofindividualpotentiallinksforthathousehold,and

thepotentialhouseholdhasacomboscoreofatleast100,weconsideritlinked.Figure8shows

howthisruleaffectsthehouseholdsinthisgrid.Fiveofninehouseholdsthatwereinitially

unlinkedarenowlinked.Inaddition,itseemsthatthecurrentcomboscorethresholdistoo

conservative;alloftheremainingunlinkedhouseholdsappeartobetruehouseholdlinks.

Again,thisfirstpassonlyusedpotentiallinksabove0.9surnameJ-W,andouroriginalpotential

linksfilecontainspotentiallinksdowntothe0.8surnamelevel.Afterflagginglinked

householdsfromthe0.9level(boththerulesbasedlinkedhouseholdandthehouseholdlinks

basedoncomboscore),wesetthemasideandincludeallrecordsfromcurrentlyunlinked

householdsandrepeattheprocess.Table7givesthenumberofhouseholdslinkedattheend

ofthe0.8surnamelevelpass(8categories);2.4millionlinkedhouseholdsconsistingofover9

millionindividuallinks.

Table7alsogivesthenon-migrationratesforthe8categoriesofhouseholdlinks.However,

sinceweusedthepresenceofcommonneighborstoestablish6ofthe8categoriesoflinked

households,thenon-migrationrateisnotanindicationofconsistency(atleastnotasa

comparisontothecategoriesofhouseholdlinks(i.e.,rulesbased)wherewedidnotusethe

presenceofcommonneighborstoestablishthelink).Acomparisonofthe1stcategory(rules-

basedhouseholdlinksusinga0.9thresholdforsurname)tothe5thcategory(rules-based

householdlinksusinga0.8thresholdforsurname)showsthatthelattercategorydoeshavea

lowerrateofnon-migration,whichcouldbeindicativeofahigherrateoffalsepositives.Table8

replicatesthediagnosticsshownearlierinTable4(whichusedthe0.9surnamethreshold,rules

basedhouseholdlinks).Ingeneralthe2ndbatchofrules-basedhouseholdlinkshavelower

ratesofnon-migrationcomparedtothesamecategoriesinTable4,butoveralltherangefor

the0.8threshold(rules-based)householdlinksissimilartowhatwefoundforthe0.9threshold

(rules-based)householdlinks.

ThetoppanelinTable9showsthehouseholdlinkagerateforall1880householdsbythe

numberof1880linkablerecords.IncontrasttoTable5,whereweonlyincludedthefirstbatch

ofrules-basedhouseholdlinks(usingthe0.9surnamethreshold),thisversionincludesallofour

Page 25: Evaluating the Accuracy of Linked U. S. Census Data: A

24

currenthouseholdlinks.Ouroveralllinkagerateisnowover24percent,althoughthelinkage

rateremainsquiteabitlowerforthesmallerhouseholds.Thebottompanelofthetable

restrictstheuniverseto1880householdsatriskofbeinglinkedandgivesthehouseholdlinkage

ratebyraceandnativity.Sinceweeventuallywerewillingtolink1880householdswithtwo

linkablerecords,theonly1880householdsnotinthelinkableuniversearethe1880households

thatonlycontainonelinkablerecord.Thelinkageratefor1880linkablehouseholdsis26.3

percent,whichislowerthanthecomparablefigureinTable5(whichwas32.4percent).Butthe

linkablehouseholduniversehereisinflatedbytheinclusionof1880householdscontainingonly

twolinkablerecords(whichmakeupalmosthalfofthe1880households,butareonlylinked7

percentofthetime).Andwesuspectthatmanyofthehouseholdscontainingonlytwolinkable

recordsdidnotexistin1880(i.e.,youngermarriedcouples).

Table9gavethenumberofindividualpotentiallinkscontainedinourcurrentbatchoflinked

households.However,thisunderestimatesthenumberoftruelinksinthelinkedhouseholds;

similartowhatwefoundinourSt.Louislinkedhouseholds,wehavemanycurrentlyunlinked

recordsinourlinkedhouseholdsthatappeartobethetruelink.Figure9showsafewexamples

oflinkedhouseholds.Inthefirstexampleweestablishthelinkedhouseholdbasedonthe

householdheadandspousein1880(W.N.andSarahAnn)andoneoftheirchildren(Ida).

However,thereareotherchildreninthe1880householdwhowerealsopresentinthe

householdin1870.Butwewereunabletoestablishtheselinksattheindividuallevelbecause

ofbirthplaceinconsistency(JohnandWalterhadmissingbirthplaceinformationin1870,while

HowardwasborninIowain1870andIllinoisin1880)andlowgivennamesimilarity(Coravs.

Carrieforthedaughter).Andwecanassumethateight-year-oldWillieinthe1880household

wasnotyetbornin1870.

Thesecondexampleshowsahouseholdwithfourexplicitlinks.Thethreeunlinkedmembersin

the1880householdalsoappeartobeinthe1870householdbutwereunlinkedbecauseof

excessivedifferencesinexpectedage(theheadwasage28in1870andage46in1880,while

thespousewasage25in1870andage43tenyearslater)andgivenname(AnnE.vsAnaliscia).

Andweassumethetwoothermembersofthe1880householdwerenotpresentin1870

Page 26: Evaluating the Accuracy of Linked U. S. Census Data: A

25

(MinnieI.was9-years-oldin1880andwasprobablyunbornin1870andJohnPetermanwasa

21-year-oldunrelatedindividualin1880).

Thehouseholdsinthethirdexamplecontainfiveindividualsexplicitlylinked.Wewereunable

tolinkElwoodC.attheindividuallevelbecausehe/shewasenumeratedasamalein1870and

asafemalein1880.Despitethenamedifference,wearefairlyconfidentthat0-year-old

RosettaJ.in1870isactually10-year-oldJosephineR.in1880.Itisalsopossiblethat21-year-old

Minervain1870is29-year-oldLouizaJ.in1880.ButincontrasttoRosettaJ.-JosephineR.,

wheretransposingfirstandmiddlenamesresultsinsimilarity,thereisnoobviouscommonness

betweenthenamesMinervaandLouizaJ.

Theseexamplesarenotstrictlyrepresentative,butdemonstratethatmanyofourlinked

householdsin1880containunlinkedrecordsthatalsohavetheirtruelinkinthe1870

household.Ingeneral,ifweestablishalinkedhousehold,thenweexpectunlinkedrecordswith

anuclearrelationship(i.e.,head,spouseorchild)andagegreaterorequalto10toalsobe

presentinthe1870household.Therearecategorieswherethisassumptionislesslikelytobe

true.Forexample,anolderchildin1880mighthavealreadylefthomeatthetimeofthe1870

censusdespitebeingpresentforthe1880enumeration.Theyoungestlinkablechildrenin

1880—ten-oreveneleven-year-oldsforexample—mightactuallyhavenotbeenbornatthe

timeofthe1870census(andsomeofthenine-oreveneight-year-oldsin1880wereactually

alivein1870).Spouseswithlowageornamesimilaritycouldbeindicativeofsecondmarriages.

Giventheseexceptionstoourgeneralassumptionsaboutco-residentialpersistence,weinitially

adoptedafairlyconservativeapproachtoforcinglinkagesbetweenrecordswithlowsimilarity

forkeylinkagevariables.

Wewilleventuallydevelopamorenuancedapproachtodealwiththiscomplexproblem,but

forthispaperweadoptedasimpleprocedurebasedonourhouseholdlinkingrules.Firstwe

dropallthresholds,andcompareallunlinkedhouseholdmembersfromthe1870householdto

alllinkablemembersofthe1880household(i.e.,weblockbyhouseholdandexclude1880

recordsyoungerthantenandthosewithanon-nuclearrelationshiptohead).Weawardone

pointforeachofthefollowing:samesex,samebirthplace,agewithin4yearsofexpectedage,

Page 27: Evaluating the Accuracy of Linked U. S. Census Data: A

26

andgivennamesimilaritygreaterthan0.9.UsinganexamplefromFigure9,ElwoodCin1880

wouldgetthreepointsforthecomparisontoElwoodCin1870(onepointeachforgivenname,

age,andbirthplace—butnotforsex—foratotalofthreepoints).Themaximumnumberof

pointsfortheforcingprocedureisthreepoints(becausealloftheserecordsfailedtolink

initiallybecauseoflowsimilarityormismatchinatleastoneofthekeylinkagevariables).Ifa

comparisongetsthreepoints,andnoothercomparisongetsatleastthreepoints,thenwe

forcethelink.

Figure10showstheforcedlinkingprocedureappliedtothehouseholdsfromFigure9.Despite

failurestoinitiallylinkattheindividuallevel,alloftheforcedlinkslookhighlyprobablewiththe

exceptionofLouizaJ.toMinervaintheMillerhousehold,butevenherewewouldassumethat

thereisapossibilitythatLouizaJisactuallyMinerva.Theforcingprocedureestablisheslinksfor

1,183,892records,orabout71percentoftheunlinkedbutlinkable1880records.Someofthe

currentforcedlinksareerrors,butweanticipaterefiningtheapproachtoaddresstheissueof

falsepositives.Butitalsoappearsthatmanyofthelinkablebutstillunlinked1880recordsdo

havetheirtruelinkresidinginthe1870household.InthefirstexampleinFigure11wehave

oneunlinkedrecordin1880household,21-year-oldJohnW.,whoisprobably11-year-old

Walkerinthe1870household.Inadditiontothelowsimilaritybetweenthegivennames,the

tworecordshavemismatchedbirthplaces.ThesecondhouseholdinFigure8showsanextreme

exampleofambiguityintheforcingprocess.The1870householdcontainstwo13-year-old

maleswithgivennamesofAbdaFandAbbaF.Despitethepresenceoftwomalesinthe1880

householdwhowere23yearsold,theforcingprocedurecannotdeterminethecorrectlink(i.e.,

becauseeithercouldbeFelixorFestusinthe1880household).

Areviewofourforcedlinksdisclosesthatlowgivennamesimilaritywastheprimaryreason

recordswerenotlinkedaspartoftheinitialhouseholdlinkingprocess.Weanticipateimproving

ourgivennamestandardizationprocess,whichwouldincreasethegivennamesimilarityfor

someoftheserecords(andthusincreasingtheprobabilitythattheserecordswillbecompared

totheirtruelinkattheindividuallevel).Butasseeninpreviousexamples,manytruelinkswith

lowgivennamesimilaritywereenumeratedwithdistinctlydifferentgivennamesinthetwo

Page 28: Evaluating the Accuracy of Linked U. S. Census Data: A

27

enumerations.Wehave41,472maleswiththegivennameofHenryin1880inthegroupof

forcedlinks.Approximately45percentalsohadagivennameofHenryin1870,withamuch

smallerpercentagehavingnamesorvariantsthatcouldbestandardizedasHenry(likeHarryor

Harvey).ButmosthavegivennamesthataredefinitelynotHenry.Forexample,wehave1,714

Henry-Williamcombinationsandalmost40percentoftheWilliamshaveamiddleinitialof‘H’

in1870.Manyoftheforcedlinksthathavelowgivennamesimilarityalsohaveamiddleinitial

thatincreasesconfidenceinthelink,butamajoritydonothavemiddlenameorinitial

information.

Althoughwehavenotfinisheddevelopingacomprehensiveapproachtothehouseholdlinking

process,wehavebeguntoassesstherangeofprecisionforourkeylinkagevariables.Tables10

and11givetherangeofimprecisionforourcurrentlinkeddata,whichincludesbothexplicit

andforcedlinks.Ingeneral,precisionlevelsarehigherforourcomplete-counthouseholdlinks

comparedtotheSt.Louishouseholdlinks(seeTables2and3).However,theabilitytomake

strictcomparisonsislimitedbyanumberoffactors.Forexample,approximately11percentof

ourcomplete-counthouseholdlinkshavesurnamesimilaritybelowthe0.9level.The

comparablefigurefortheSt.Louishouseholdlinkswas28percent.Weexpecttheproportion

ofcomplete-countlinkswherethisistruetoincreaseasweloweroursurnamesimilarity

thresholdinthepotentiallinkselectionprocess;i.e.,someofthecurrentlyunlinkedhouseholds

areunlinkedpreciselybecauseallhouseholdmembershavelowsurnamesimilaritytotheirtrue

links.17TherelativelycloseduniverseofthetwoenumerationsofSt.Louis,alongwiththe

availabilityofstreetandhousenumberinformation,allowedustolinksmallerhouseholdsor

householdswithlowlevelsofsimilarity;inotherwords,wewereabletogetclosertothe

bottomofthebarrelthanwewilleverbeabletodowithhouseholdsenumerated10years

apart.18

17 AndthissamelogicwouldapplyhigherlevelsofimprecisionforplaceofbirthinthelinkedSt.Louisdata;weblockedbyplaceofbirthinconstructingthecomplete-countindividuallevellinks;wesuspectthatsomeofthecurrentlyunlinkedhouseholdsareunlinkedbecausemostorallhouseholdmembershavemismatchedbirthplaceinformation.18 ItisalsopossiblethatthefirstenumerationofSt.Louiswasanexampleofashoddilytakencensus,whilethesecondenumeration—whichusedareferencedatefivemonthspriortothedateoftherecount—introducedimprecisioninrecordinginformationforindividualswhohadleftthecity.Amore

Page 29: Evaluating the Accuracy of Linked U. S. Census Data: A

28

GoingForward

Ourcurrentlinkageprojectwilleventuallyincludelinkscoveringthe1850,1860,1870,and

1880complete-countcensusdatabases.Basedonourinitialresults,wearefairlyconfidentthat

wewilllinkafairlysizableproportionof1880recordstoallthreeofthepreviousdecennial

censusesusingthehouseholdlinkingapproach(yearofbirthpermitting).Goingforward,some

ofourworkwillfocusonbettermethodsofidentifyingandeliminatingfalsepositives.Theuse

ofadditionalevidencederivedfromcommonneighborsandco-residentkinimpliesthatwe

haveahigherstandard;our(unachievable)goalistonevermakeanincorrecthouseholdlink.

Qualitycontrolcanbetedious(anddemoralizingwhenituncoversalogicalflawortwo)butitis

anecessarypartoftheprocess.Andwewillcontinuetoevaluatequalityissuesasweproceed

tocreateadditionalhouseholdlinksinthe1870-1880data.Somehouseholdswillneverbe

linked,butwehopetoultimatelydoubleourcurrenthouseholdlinkagerate.Someofour

optimismisbasedonourexperiencewithSt.Louis;althoughwealreadysawdiminishing

returnsinoursecondpassusingalowersurnamethreshold,weanticipatefindingadditional

householdlinksbelowa0.8surnamesimilaritythreshold.Wealsosuspectasignificantnumber

ofhouseholdsremainunlinkedbecauseofbirthplaceinconsistency.Analysisofourforced

links—oftenforcedduetolowgivennamesimilarity—willresultinimprovedgivenname

standardizations(oraliases).Wewillalsorefineourmeasurementofhouseholduniqueness

andneighborcalculations.ThePHHN(i.e.,commonneighbors)approachneedssome

calibration,butpromisestolinkmanyadditionalhouseholds.

Althoughweanticipatecontinuingtofindhouseholdsbasedontheprocessofelimination,

somehouseholdswillremainunlinkedbecausetheydidnotexistinthepreviouscensus.A

commonexamplewouldbeoldersonsinthe1870censuswholeavehomeandgetmarried;

charitableinterpretationwouldbethatimprecisionfoundinSt.Louisin1880wouldberepresentativeofenumerationsinlargeAmericancitiesinthenineteenthcentury,andthatwewouldexpectgreaterprecisionforindividualsenumeratedinsmalltownsandruralareas.WhetherornottheimprecisioninthelinkedSt.Louisdataisanoutlierisaninterestingissue,butwenonethelessalsofindrelativelyhighimprecisioninthecomplete-countlinkeddata.

Page 30: Evaluating the Accuracy of Linked U. S. Census Data: A

29

thustheywillbelivingwithaspouseandchildrenundertheageof10inthe1880census.

However,ifwecanlinktheirhouseholdoforiginin1870toan1880householdandverifythat

theywereabsentfromthat1880linkedhousehold,thenwearemoreconfidentincreatinga

householdlinkabsentthepresenceofanycorroborativekin.Figure12givesanexamplebased

onthegridexample(i.e.,Figure8).Figure12givestheentire1870and1880householdsforthe

Mathishousehold,andwecanseethatthefouroldestsonsinthe1870householdwerenot

presentwhenthehouseholdwasenumeratedin1880.Althoughthishouseholdwasnotthe

referencepointforthisspecificgrid,wecanidentifywhatappearstobetwooftheabsentsons

(withtheirwivesandchildren)inthisgrid,andtheyarelocatedincloseproximitytothe1880

householdthatcontainstheirparentsandyoungersiblings.Wedonotknowhowmanyof

thesetypesofhouseholdswewillbeabletolink,butwebelievetheuseofcommonneighbor

informationgreatlyexpandsourabilitytoconfidentlyverifylinkagedecisions.

Thehouseholdlinkswillbeusefulforsometypesofanalysis(e.g.,wheretherelevantunitof

studyconsistsofmarriedcouplesorrelatedgroups)buttheywilldefinitelybebiased.Butwe

alsoanticipatecontinuingtoconstructindividual(minimalbias)levellinks.Herethehousehold

linkscanbeusedintwoprimaryways.Theycanbeusedasaverificationsetforlinks

establishedattheindividuallevel.Andthehouseholdlinksareanimportantpartofthisprocess

becauseofthepresenceoftheforcedlinks(i.e.,linksnotinitiallypresentinourpotentiallinks

file,typicallybecauseoflowsimilarityormismatchinatleastonelinkagevariable).Forthe

mostpart,theserecordswillrarelybelinkedbyindividual-levelclassifiers.Anaccurate

estimationofthefalsepositiveraterequiresestablishingalltruelinks(despitelowsimilarityor

mismatchedlinkagevariables)andtheonlywaytodothisistouseamaximumamountof

information(i.e.,thehouseholdlinkageprocess).

Oneissuewiththehouseholdlinksasaverificationsetisthattheywillnotcovertheentire

populationofindividual-levellinks.Thisistrue(i.e.,someindividuallevellinkswillnotbe

verifiedbecausewedidnotlinktheirhousehold),butwesuspectthatourindividual-levellinks

willcontainadisproportionatelyhighnumberoflinksestablishedatthehouseholdlevel.Thisis

becausetheinabilitytobelinkedatthehouseholdlevelimpliesanumberofconditionsor

Page 31: Evaluating the Accuracy of Linked U. S. Census Data: A

30

characteristicsattheindividuallevel.

Wewould(theoretically)expectsimilarlevelsoflinkagevariableprecisionforsomegroupsof

individualsnotlinkedatthehouseholdlevel(comparedtothoselinkedatthehouseholdlevel).

Thiswouldincludethesonswhotransitiontomarriageandtheestablishmentoftheirown

householdsbetweencensuses.Thiswouldalsoincludehouseholdswithrelativelycommon

names(combinedwithlargebirthplaceblocks)thatremainunlinkedbecauseofambiguity

(especiallyiftheylackcommonneighborsinthetwocensuses).Thesetwogroupsshouldhave

overallprecisioncomparabletothehouseholdlinkedset(althoughambiguousrecordsatthe

householdlevelwillalsobeambiguousattheindividuallevel).

Butmanymembersofthehouseholdlinkingresistancearehardercases.Under-enumerationin

the19thcenturywasfairlyhigh(possiblyashighasfivepercent).Wealsohavesome1880

householdsthatwerenotinthecountryin1870(andfromarecordlinkageperspectivethey

aresimilartounder-enumeratedrecords).Wearestillinthespeculativestage,butinaddition

tohouseholdsmissingfromoneenumerationortheother,itseemsplausiblethatatleastthat

manyareunderwater(i.e.,wewillneverbeabletolinkthemevenatthehouseholdlevel

becauseoflowsimilarityormismatchforoneormorelinkagevariables).Lessextreme,butstill

problematic,isthesizable19thcenturyunrelatedpopulation.Rarelywilltheybeco-resident

withthesamepeopleinbothcensuses.Andtheaccuracyoftheirnames,ageandbirthplace

willundoubtedlyvary,butwesuspectthatthequalityofinformationforunrelatedindividuals

enumeratedinthe19thcenturyisrelativelypoor.

Sothepartofthe1880populationthatisnotpartofthehouseholdlinkeduniversewillconsist

ofahigherproportionrecordsthateitherdonothaveatruelinkorhavearelativelylow

similaritytruelink.Itispossiblethatanindividual-levelclassifiertrainedandtestedonthe

householdlinks(andcalibratedtogetanoptimalcombinationoflinkageandfalsepositive

rates)willnotperformnearlyaswellonthesetofrecordsthatwerenotlinkedatthe

householdlevel(primarilybecausethisuniversecontainsmanyrecordswithouttruelinks,and

someoftheserecordsgetlinkedrandomlyatlowerlevelsofclassifier-approvedthresholds).

Butmaybethisreallydoesnotmatter.Itispossiblethatawell-designedindividuallevel

Page 32: Evaluating the Accuracy of Linked U. S. Census Data: A

31

classifierachieves“acceptably”lowfalsepositiverates,inthatthepresenceofsomeincorrectly

linkedrecordsdoesnotsignificantlyaffectresearchresults.Thishasbeenthestandarddefault

positionforpreviouslinkageprojects,butithasmostlybeenbasedonspeculativeoptimism

(i.e.,faith-basedrecordlinkage).Ultimatelywehopetoproduceafairlycomprehensivesetof

verifiedhouseholdlinksfor1850through1880.Wewillalsoproducelinkeddataatthe

individuallevel.Thuswewillhavethreedifferentlinkedsets:householdlinks;individual-level

links;andindividual-levellinkswiththefalsepositivesremoved(i.e.,falsepositivesidentifiedby

comparingtheindividual-levellinkstothehouseholdlinks).Weplanonexperimentingwith

differenttypesofanalysis(e.g.female-laborforceparticipation,social-economicmobility,etc.)

toseeifwegetdifferentresultsbasedonwhichlinkedsetweuse.

Page 33: Evaluating the Accuracy of Linked U. S. Census Data: A

Figure1.PrimaryandSecondaryLinks,1870-1880Male-OnlySample

linktype fname70

lname70

age70

relate70

fname80

lname80

age80

relate80

unlinked JOHN MCHUGH 50 head unlinked REBECCA MCHUGH 37 spouse primary HENRY MCHUGH 14 child HENRY MCHUGH 25 child

unlinked JAMESE MCHUGH 3 child unlinked JANER MCHUGH 0 child unlinked

CATHARINE MCHUGH 64 head

unlinked

ELLEN MCHUGH 38 child

unlinked

EDWARD MCHUGH 35 child

unlinked

MARYF. MCHUGH 27 child

unlinked

MARYE. MCHUGH 16 grandchild

unlinked

EDWARDJ. MCHUGH 12 grandchild

linktype fname

70lname70

age70

relate70

fname80

lname80

age80

relate80

primary JAMES FELKINS 61 head JAMESH. FELKIN 71 head

unlinked MARTHA FELKINS 53 spouse secondary NANCY FELKINS 35 child NANCY FELKIN 42 child

unlinked BUNELL FELKINS 28 child secondary ELISABETH FELKINS 16 child ELISIBETH FELKIN 23 child

secondary PAIKNEY FELKINS 14 child PINKNY FELKIN 22 child

unlinked

MATILDA FELKIN 67 spouse

Notes:fname70=firstnamein1870lname70=lastnamein1870age70=agein1870relate70=imputedrelationshiptoheadin1870fname80=firstnamein1880lname80=lastnamein1880age80=agein1880relate80=imputedrelationshiptoheadin1880

Page 34: Evaluating the Accuracy of Linked U. S. Census Data: A

Figure2.1850SlaveSchedule

Page 35: Evaluating the Accuracy of Linked U. S. Census Data: A

Figure3a.PotentialmatchesforJohnO’Donnell,St.Louis1880

fname1 lname1 age1 fname2 lname2 age2JOHN O'DONNELL 43 JOHN O'DONNELL 45

JOHN O'DONNELL 45

JOHN O'DONNELL 46

Figure3b.HouseholdscontainingpotentiallinksforJohnO’Donnell,St.Louis1880

fname1 lname1 age1 fname2 lname2 age2 p_link sum_p_linkJOHN O'DONNELL 43 JOHN O'DONNELL 46 1 5MARY O'DONNELL 43 MARY O'DONNELL 44 1 5

MICHAEL O'DONNELL 15 MICHAEL O'DONNELL 16 1 5PATRICK O'DONNELL 9 PATRICK O'DONNELL 9 1 5BRIDGET O'DONNELL 6 BRIDGET O'DONNELL 5 1 5

JOHN O'DONNELL 43 JOHN O'DONNELL 45 1 1MARY O'DONNELL 43 ELLEN O'DONNELL 40 0 1

MICHAEL O'DONNELL 15 JULIA O'DONNELL 12 0 1PATRICK O'DONNELL 9 0 1BRIDGET O'DONNELL 6 0 1

JOHN O'DONNELL 43 JOHN O'DONNELL 45 1 1MARY O'DONNELL 43 MARGRET O'DONNELL 39 0 1

MICHAEL O'DONNELL 15 JOHN O'DONNELL 19 0 1PATRICK O'DONNELL 9 ELIZEBETH O'DONNELL 14 0 1BRIDGET O'DONNELL 6 FRANCIS O'DONNELL 12 0 1

WILLIAM O'DONNELL 4 0 1

Notes:fname1=firstnameinfirstenumerationlname1=lastnameinfirstenumerationage1=ageinfirstenumerationfname2=firstnameinsecondenumerationlname2=lastnameinsecondenumerationage2=ageinsecondenumerationp_link=indicatesapotentiallinkbetweenindividualslistedsum_p_link=thesumofpotentiallinksbetweenspecifichouseholds.

Page 36: Evaluating the Accuracy of Linked U. S. Census Data: A

Figure4.Alinkedhousehold,St.Louis1880

fname1 lname1 age1 fname2 lname2 age2 p_link sum_p_link

J-Wfname

J-Wlname

AUTONIA STROUBEL 52 ANTON STRUBE 53 0 4 0.69 0.94ELIZABETH STROUBEL 42 ELIZA STRUBE 42 1 4 0.91 0.94ANNIE STROUBEL 19 ANNIE STRUBE 14 0 4 1.00 0.94MINNIE STROUBEL 12 MINNIE STRUBE 13 1 4 1.00 0.94LOUISA STROUBEL 10 LOUISE STRUBE 11 1 4 0.93 0.94DORETTA STROUBEL 4 DORA STRUBE 5 1 4 0.90 0.94

Notes:fname1=firstnameinfirstenumerationlname1=lastnameinfirstenumerationage1=ageinfirstenumerationfname2=firstnameinsecondenumerationlname2=lastnameinsecondenumerationage2=ageinsecondenumerationp_link=indicatesapotentiallinkbetweenindividualslistedsum_p_link=thesumofpotentiallinksbetweenspecifichouseholds.J-Wfname=Jaro-WinklersimilarityscoreforfirstnamestringsJ-Wlname=Jaro-Winklersimilarityscoreforlastnamestrings

Page 37: Evaluating the Accuracy of Linked U. S. Census Data: A

Figure5.Linkedhouseholds,St.Louis1880

fname1 lname1 age1 fname2 lname2 age2 fnameJ-W

lnameJ-W

MATHEW BURGHERDT 40 MATHEW BURKHART 47 1.00 0.86ELIZABETH BURGHERDT 40 ELIZABETH BURKHART 40 1.00 0.86CATHERINE BURGHERDT 12 KATE BURKHART 11 0.69 0.86ELIZABETH BURGHERDT 9 ELIZABETH BURKHART 9 1.00 0.86WILLIAM BURGHERDT 4 WILLIAM BURKHART 4 1.00 0.86

fname1 lname1 age1 fname2 lname2 age2 fname

J-WlnameJ-W

DAVID FITZGERALD 48 DAVE VETZGURA 45 0.85 0.67MARY FITZGERALD 34 MARY VETZGURA 36 1.00 0.67ANNIE FITZGERALD 12 ANNA VETZGURA 12 0.85 0.67KATE FITZGERALD 10 KATE VETZGURA 11 1.00 0.67

ANDREW FITZGERALD 5 ANDREW VETZGURA 6 1.00 0.67NORA FITZGERALD 2 MONORA VETZGURA 3 0.81 0.67

RICHARD FITZGERALD 0 RICHARD VETZGURA 0 1.00 0.67

fname1 lname1 age1 fname2 lname2 age2 fnameJ-W

lnameJ-W

FRANK KLAESER 60 F.H. CLASSEN 60 0.76 0.63BRIDGET KLAESER 56 BRIDGET CLASSEN 58 1.00 0.63

fname1 lname1 age1 fname2 lname2 age2 fname

J-WlnameJ-W

CAROLINE SCHWARTZ 60 CATHERINE SCHMARG 60 0.76 0.85AUGUSTA SCHWARTZ 26 AUGUSTE SCHMARG 25 0.94 0.85

Notes:fname1=firstnameinfirstenumerationlname1=lastnameinfirstenumerationage1=ageinfirstenumerationfname2=firstnameinsecondenumerationlname2=lastnameinsecondenumerationage2=ageinsecondenumerationfnameJ-W=Jaro-WinklersimilarityscoreforfirstnamestringslnameJ-W=Jaro-Winklersimilarityscoreforlastnamestrings

Page 38: Evaluating the Accuracy of Linked U. S. Census Data: A

Figure6.Selectedsurnamecombinationsinthelinkeddata,St.Louis1880

lname1 lname2 J-W NYSIIS Doublemeta match1 match2 match3

COBB COBBS 0.96 1 1 1 1 1MAIER MIER 0.94 1 1 1 0 0BLOCH BLOCK 0.92 0 1 1 1 1

SCHLEGEL SCHLAEGD 0.90 0 0 1 1 1KAMPF KEMPF 0.88 1 1 1 0 0LAMPE LAMPKING 0.86 0 0 1 1 1NOOTEN NEWTON 0.84 1 1 1 0 0BORGERS BORSGUS 0.82 0 0 1 1 1GERRAN GUERIN 0.80 1 1 1 0 0

THORNALLY TOMALLI 0.78 0 0 1 0 0BOETTE BOOTH 0.76 0 0 1 1 0

BROCHRIGT BROOKLINE 0.74 0 0 1 1 1HEFFNER HOFFMANN 0.72 0 0 1 0 0RUBIN LUBIER 0.70 0 0 0 0 0

GOTTMAYER KOLMEYER 0.66 0 0 0 0 0THOMA TGNAZ 0.64 0 0 1 0 0BOICE NOYES 0.60 0 0 0 0 0

KOOKENBERG GUEGGESBERY 0.55 0 0 0 0 0KEEVIL DRISCOLL 0.53 0 0 0 0 0

Notes:0/1indicatesthatthenamecombinationwouldnotmatch/matchforphonetic/matchingcodesJ-W=Jaro-WinklersimilarityscoreforlastnamecombinationNYSIIS=whetherthenamecombinationhasaNYSIISmatchDoublemeta=whetherthenamecombinationhasadoublemetaphonematchMatch1=whetherthenamecombinationmatchesonfirstletterMatch2=whetherthenamecombinationmatchesonfirst2lettersMatch3=whetherthenamecombinationmatchesonfirst3letters

Page 39: Evaluating the Accuracy of Linked U. S. Census Data: A

Figure7.Examplesoffirstnamemismatches,St.Louis1880

fname1 lname1 age1 fname2 lname2 age2C.ALBERT RAHNER 24 ALBERT RAHNER 24BERNARD HILL 23 C.BERNARD HILL 22BRIDGET CARTEN 33 M.BRIDGET CARTEN 34C.AMELIA SHEERER 32 AMALIAC. SHERER 35

fname1 lname1 age1 fname2 lname2 age2BAYARDN. ABBOTT 4 NELSON ABBOTT 3

BELLE HILTON 5 IDAB. HILTON 5DAVID SUTTMUELLER 47 JOHND. SULTMULLER 48ELLEN ROBINS 2 MARYE. ROBBINS 3

fname1 lname1 age1 fname2 lname2 age2THECKLA NIEHAUS 57 MARY NIEHAUS 57TIMOTHY LYNCH 17 BUD LYNCH 18LILLY WALSER 0 GRACE WALSER 0

WILLIAM PERRIN 0 EUGENE PERRIN 0

Notes:fname1=firstnameinfirstenumerationlname1=lastnameinfirstenumerationage1=ageinfirstenumerationfname2=firstnameinsecondenumerationlname2=lastnameinsecondenumerationage2=ageinsecondenumeration

Page 40: Evaluating the Accuracy of Linked U. S. Census Data: A

Figure8.SampleNeighborGrid,LivingstonCounty,Illinois,1870-1880CompleteCount

rulesonly

rulesplusneighbors

serial70diff

Serial80diff

fname70 lname70age70

fname80 lname80age80

neighborcount(PHHN)

uniquescore

comboscore

-12 -67 MARY WOODRUFF 34 MARY WOODRUFF 43 10 5 50

-12 -67 ALPHONSO WOODRUFF 15 ALPHONSO WOODRUFF 25 10 5 50

linked *** -19 -66 JOHN ARNOLD 52 JOHN ARNOLD 63 10 6 60linked *** -19 -66 LOUISA ARNOLD 50 LOUISA ARNOLD 61 10 6 60linked *** -19 -66 WILLIAM ARNOLD 26 WILLIAM ARNOLD 35 10 6 60linked *** -19 -66 FRANKLIN ARNOLD 17 FRANKLIN ARNOLD 27 10 6 60

linked -13 -64 MARY BUSSARD 50 MARY BUZZARD 61 10 12 120

linked -13 -64 OZILLA BUSSARD 25 ROZILLA BUZZARD 36 10 12 120

linked -13 -64 WILLIAM BUSSARD 19 WILLIAM BUZZARD 28 10 12 120

-25 -17 GEORGE CHRITTEN 16 GEO CRITTEN 26 17 3 51

-25 -17 WILLIAM CHRITTEN 43 WILLIAM CRITTEN 55 17 3 51

0 0 SARAH TURK 39 SARAH TURK 48 13 2 26

0 0 EVALINE TURK 4 EVALIENE TURK 13 13 2 26

-10 2 JOSEPH KIME 35 JOSEPH KIME 48 17 2 34

-10 2 SUSAN KIME 31 SUSAN KIME 39 17 2 34

linked *** 2 14 D DEFENBAUGH 37 DAVID DEFFENBAUGH 46 19 36 684linked *** 2 14 ISABELL DEFENBAUGH 37 ISABELLA DEFFENBAUGH 48 19 36 684linked *** 2 14 GEORGIANNA DEFENBAUGH 9 GEORGANNA DEFFENBAUGH 19 19 36 684linked *** -7 28 SAMUEL THOMSON 52 SML THOMPSON 62 17 4 68linked *** -7 28 HARIET THOMSON 47 HARRIET THOMPSON 58 17 4 68linked *** -7 28 EDGAR THOMSON 5 EDGAR THOMPSON 15 17 4 68linked *** -29 43 CALEB MATHIS 46 CALEB MATHIS 56 22 18 396linked *** -29 43 SOFLENA MATHIS 43 SOPLENA MATHIS 53 22 18 396linked *** -29 43 SOFLENA MATHIS 9 SOPLENA MATHIS 19 22 18 396linked *** -29 43 WILLIAM MATHIS 7 WILLIAM MATHIS 16 22 18 396linked *** -29 43 HELLAND MATHIS 2 HOLLAND MATHIS 12 22 18 396

linked -47 44 SAERTIS SMITH 51 LAERTES SMITH 61 27 8 216

linked -47 44 LOUISA SMITH 48 LOUISA SMITH 59 27 8 216

linked -38 47 ANDREW WRIGHT 55 ANDREW WRIGHT 65 26 14 364

linked -38 47 EMELINE WRIGHT 44 EMMELINE WRIGHT 54 26 14 364

linked -41 53 WILLIAM BOATMAN 52 WILLIAM BOATMAN 62 26 7 182

linked -41 53 ELENOR BOATMAN 50 ELEANOR BOATMAN 60 26 7 182

linked -40 75 CHARLES THRASHER 55 CHARLES THRASHER 65 26 40 1040

linked -40 75 MARY THRASHER 43 MARY THRASHER 52 26 40 1040

linked -40 75 THANKFUL THRASHER 6 THANKFUL THRASHER 16 26 40 1040

Page 41: Evaluating the Accuracy of Linked U. S. Census Data: A

Figure9.LinkedHouseholdExamples,1870-1880Complete-Count

linked80 name1_70 name2_70 name1_80 name2_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80linked WN AYERS W.N. AYERS head 45 54 male male Ohio Ohiolinked SARAH AYERS SARAHANN AYERS spouse 41 51 female female Vermont Vermont

unlinked

JOHN AYERS child

24

male

Washingtonunlinked

WALTER AYERS child

22

male

Washington

unlinked

HOWARD AYERS child

19

male

Illinoislinked IDA AYERS IDA AYERS child 6 16 female female Illinois Illinois

unlinked

CARRIE AYERS child

14

female

Iowaunlinked

WILLIE AYERS child

8

male

Arkansas

JOHN AYERS

14

male

missing

WALTER AYERS

12

male

missing

HOWARD AYERS

9

male

Iowa

CORA AYERS

4

female

Iowa

linked80 name1_70 name2_70 name1_80 name2_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80unlinked

HENRYC. CUTTING head

46

male

Ohio

unlinked

CORDELIA CUTTING spouse

43

female

Vermontlinked LUCYA CUTTING LUCY CUTTING child 10 20 female female Ohio Ohiolinked WILLIAMH CUTTING WILLIAMK. CUTTING child 7 19 male male Ohio Ohio

unlinked

ANALISCIA CUTTING child

17

female

Ohiolinked SAMUELJ CUTTING SAMUELJ. CUTTING child 4 14 male male Ohio Ohiolinked CORAA CUTTING CORAA. CUTTING child 1 11 female female Ohio Ohio

unlinked

MINNIEI. CUTTING child

9

female

Ohiounlinked

JOHN PETERMAN unrelated

21

male

Ohio

HENRY CUTTING

28

male

Ohio

CORDELIA CUTTING

25

female

Vermont

ANNE CUTTING

5

female

Ohio

linked80 name1_70 name2_70 name1_80 name2_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80linked NATHAN MILLER NATHAN MILLER head 54 63 male male Ohio Ohiolinked MARGARETD MILLER MARGARETD. MILLER spouse 53 63 female female Ohio Ohiolinked CHARLESH MILLER CHARLEN. MILLER child 10 20 male male Ohio Ohiolinked SARAHJ MILLER SARAHM. MILLER child 13 23 female female Ohio Ohio

unlinked

ELWOODC. MILLER child

17

female

Ohiounlinked

LOUIZAJ. MILLER child

29

female

Ohio

linked JOHNW MILLER JOHNW. MILLER child 2 12 male male Ohio Ohiounlinked

JOSEPHINER. MILLER child

10

female

Ohio

ELWOODC MILLER

7

male

Ohio

MINERVA MILLER

21

female

Ohio

ROSETTAJ MILLER

0

female

Ohio

Page 42: Evaluating the Accuracy of Linked U. S. Census Data: A

Figure10.LinkedHouseholdExamplesAfterForcedLinkingProcess,1870-1880Complete-Count

linked80 name1_70 name2_70 name1_80 name2_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80

explicit WN AYERS W.N. AYERS head 45 54 male male Ohio Ohio

explicit SARAH AYERS SARAHANN AYERS spouse 41 51 female female Vermont Vermont

explicit IDA AYERS IDA AYERS child 6 16 female female Illinois Illinois

forced CORA AYERS CARRIE AYERS child 4 14 female female Iowa Iowa

forced JOHN AYERS JOHN AYERS child 14 24 male male missing Washington

forced WALTER AYERS WALTER AYERS child 12 22 male male missing Washington

forced HOWARD AYERS HOWARD AYERS child 9 19 male male Illinois Washington

unlinked

WILLIE AYERS child

8

male

Arkansas

linked80 name1_70 name2_70 name1_80 name2_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80

forced HENRY CUTTING HENRYC. CUTTING head 28 46 male male Ohio Ohio

forced CORDELIA CUTTING CORDELIA CUTTING spouse 25 43 female female Vermont Vermont

explicit LUCYA CUTTING LUCY CUTTING child 10 20 female female Ohio Ohio

explicit WILLIAMH CUTTING WILLIAMK. CUTTING child 7 19 male male Ohio Ohio

forced ANNE CUTTING ANALISCIA CUTTING child 5 17 female female Ohio Ohio

explicit SAMUELJ CUTTING SAMUELJ. CUTTING child 4 14 male male Ohio Ohio

explicit CORAA CUTTING CORAA. CUTTING child 1 11 female female Ohio Ohio

unlinked

MINNIEI. CUTTING child

9

female

Ohio

unlinked

JOHN PETERMAN unrelated

21

male

Ohio

linked80 name1_70 name2_70 name1_80 name2_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80

explicit NATHAN MILLER NATHAN MILLER head 54 63 male male Ohio Ohio

explicit MARGARETD MILLER MARGARETD. MILLER spouse 53 63 female female Ohio Ohio

explicit SARAHJ MILLER SARAHM. MILLER child 13 23 female female Ohio Ohio

explicit CHARLESH MILLER CHARLEN. MILLER child 10 20 male male Ohio Ohio

forced ELWOODC MILLER ELWOODC. MILLER child 7 17 male female Ohio Ohio

forced MINERVA MILLER LOUIZAJ. MILLER child 21 29 female female Ohio Ohio

explicit JOHNW MILLER JOHNW. MILLER child 2 12 male male Ohio Ohio

forced ROSETTAJ MILLER JOSEPHINER. MILLER child 0 10 female female Ohio Ohio

Page 43: Evaluating the Accuracy of Linked U. S. Census Data: A

Figure11.LinkedHouseholdExamples,1870-1880Complete-Count

name1_70 name2_70 name2_80 name1_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80

WILLIAM FENTON WM.H. FENTON head 35 46 male male NewJersey NewJersey

CORDELIA FENTON CORDELIA FENTON spouse 33 44 female female DC DC

JOHNW. FENTON child

21

male

DC

SAMUEL FENTON SAMUEL FENTON child 9 19 male male DC DC

EMMA FENTON EMMA FENTON child 7 17 female female DC DC

WILLIAM FENTON WILLIAM FENTON child 5 15 male male DC DC

MARY FENTON MAY FENTON child 3 13 female female DC DC

BESSIE FENTON BESSIE FENTON child 1 10 female female DC DC

WALKER FENTON

11

male

Virginia

IDA WALKER

16

female

DC

LOUISA BROWN

27

female

Maryland

name1_70 name2_70 name2_80 name1_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80

WILLIAMJ CANTRELL W.J. CANTRELL head 56 67 male male Georgia Georgia

AMANDA CANTRELL AMANDA CANTRELL spouse 43 54 female female Georgia Georgia

FELIX CANTRELL child

23

male

Georgia

FESTUS CANTRELL child

23

male

Georgia

MARGARETA CANTRELL MAGGIE CANTRELL child 11 20 female female Georgia Georgia

JOHN CANTRELL JOHN CANTRELL child 5 13 male male Georgia Georgia

EVA CANTRELL child

8

female

Georgia

JAMESR CANTRELL

17

male

Georgia

MARGARETF CANTRELL

15

female

Georgia

ABDAF CANTRELL

13

male

Georgia

ABBAF CANTRELL

13

male

Georgia

SUSAN CANTRELL

38

female

Virginia

CHARLES CANTRELL

7

male

Georgia

ARMSTEAD CANTRELL

1

male

Georgia

Page 44: Evaluating the Accuracy of Linked U. S. Census Data: A

Figure12.LinkingOlderSons,LivingstonCounty,Illinois,1870-1880CompleteCount

fname70 lname70 age70 fname80 lname80 age80 serial80 serial80diff

CALEB MATHIS 46 CALEB MATHIS 56 *799 0

SOPLENA MATHIS 43 SOFLENA MATHIS 53 SOPLENA MATHIS 9 SOFLENA MATHIS 19 WILLIAM MATHIS 7 WILLIAM MATHIS 16 HOLLAND MATHIS 2 HELLAND MATHIS 12 GEORGE MATHIS 19

JAMES MATHIS 17

ELBERT MATHIS 13

EUGENE MATHIS 12

fname80 lname80 age80 serial80 serial80diff

GEORGE MATHIS 29 *805 6

SARAH MATHIS 27

MAY MATHIS 4

LENA MATHIS 2

CARL MATHIS 1

fname80 lname80 age80 serial80 serial80diff

JAMES MATHIS 27 *819 20

ANNA MATHIS 25

NELIE MATHIS 2

Page 45: Evaluating the Accuracy of Linked U. S. Census Data: A

Table1.Linkedhouseholds(top)andindividuals(bottom),St.Louis1880

A. Byhouseholds(HH)

NumberofRelatedinHH

1stEnumeration 2ndEnumeration

NHHs NLinkedHHs Linked% NHHs NLinked

HHs Linked%

1 3,524 505 14.3 3,855 481 12.52 10,650 6,578 61.8 11,599 6,221 53.63 11,043 8,482 76.8 11,502 8,354 72.64 10,721 9,113 85.0 11,039 9,002 81.55 9,546 8,453 88.6 9,729 8,371 86.06 7,193 6,521 90.7 7,500 6,668 88.97 4,835 4,491 92.9 5,046 4,593 91.08 2,988 2,804 93.8 3,194 2,968 92.99 1,620 1,530 94.4 1,673 1,557 93.1

10+ 1,205 1,155 95.9 1,361 1,274 93.6

All 63,325 49,632 78.4 66,498 49,489 74.4

B. Byindividuals

NumberofRelatedinHH

1stEnumeration 2ndEnumerationN

Individuals NLinked Linked% NIndividuals NLinked Linked%

1 3,519 505 14.4 3,814 481 12.62 21,300 12,588 59.1 23,198 11,897 51.33 33,129 23,801 71.8 34,506 23,108 67.04 42,884 34,146 79.6 44,156 33,247 75.35 47,730 39,854 83.5 48,645 38,908 80.06 43,134 36,763 85.2 45,000 36,967 82.17 33,831 29,599 87.5 35,322 29,790 84.38 23,888 20,981 87.8 25,552 21,785 85.39 14,580 12,799 87.8 15,057 12,867 85.5

10+ 12,688 11,147 87.9 14,487 11,896 82.1

All 276,683 222,183 80.3 289,737 220,946 76.3

Note:Relatedreferstohouseholdmembersrelatedtothehouseholdhead,eitherbiologicallyorthroughmarriage

Page 46: Evaluating the Accuracy of Linked U. S. Census Data: A

Table2a.Linkedpopulation’sdistributionbysurnamesimilaritymeasures,St.Louis1880

N Dist.(%) NYSIIS Double

Meta Match1 Match2 Match3

Lessthan0.6 2,751 1.2 0.6 3.3 13.9 0.1 0.0

0.60to0.649 2,604 1.2 3.1 7.1 44.0 4.1 0.0

0.65to0.699 3,573 1.6 7.4 16.4 59.1 8.4 0.0

0.70to0.749 6,910 3.1 10.6 18.1 68.3 20.1 1.4

0.75to0.799 9,506 4.3 19.5 29.5 79.7 33.9 7.1

0.80to0.849 15,918 7.2 32.2 40.7 86.1 41.6 18.4

0.85to0.899 20,644 9.3 39.1 47.0 90.2 64.9 34.0

0.90to0.949 27,128 12.2 49.1 57.5 96.4 83.2 68.0

0.95to0.999 25,348 11.4 66.3 74.7 99.5 93.9 86.5

1.00(Exactmatch) 108,048 48.6 100.0 100.0 100.0 100.0 100.0

All 222,430 100.0 69.4 73.6 93.4 80.7 71.5

Table2b.DistributionbyJaro-Winklerscoreforgivennames,St.Louis1880

N Dist.(%) NName

Std.

%NameStd.(byrow)

Lessthan0.6 13,092 5.9 3,080 23.50.60to0.649 3,607 1.6 971 26.90.65to0.699 4,538 2.0 1,695 37.30.70to0.749 6,491 2.9 2,136 32.90.75to0.799 8,407 3.8 3,545 42.20.80to0.849 13,813 6.2 9,415 68.20.85to0.899 14,407 6.5 9,983 69.30.90to0.949 19,464 8.8 14,418 74.10.95to0.999 13,063 5.9 10,461 80.01.00(Exactmatch) 119,595 53.8 0 0.0InitialMatch 5,953 2.7 0 0.0

All 222,430 100.0 55,704 25.0

Page 47: Evaluating the Accuracy of Linked U. S. Census Data: A

Table3.Distributionofage,sex,race,birthplaceprecision,St.Louis1880

B. Sex

N Dist.(%)

Agrees 220,323 99.1Disagrees 2,107 0.9

Total 222,430 100.0

D. Ownbirthplace

N Dist.(%)

Agrees 203,785 91.6Disagrees 18,645 8.4Total 222,430 100.0

E. Father’sbirthplace

N Dist.(%)

Agrees 182,620 82.1Disagrees 39,810 17.9Total 222,430 100.0

F. Mother’sbirthplace

N Dist.(%)

Agrees 180,917 81.3Disagrees 41,513 18.7Total 222,430 100.0

A. Agedifference

N Dist.(%)

−2(andgreater)years 13283 6.0−1year 17552 7.9

Sameage 106,275 47.8+1year 61,686 27.7

+2(andgreater)years 23,634 10.6Total 222,430 100.0

C. Race

N Dist.(%)

Agrees 221,904 99.8Disagrees 526 0.2

Total 222,430 100.0

Page 48: Evaluating the Accuracy of Linked U. S. Census Data: A

Table4.MigrationStatusforRules-BasedHouseholdLinks,1870-1880Complete-Count

SurnameSimilarity

NLinkedHHs

NonMigrant

(1)

SameStateDifferentCounty(2)

DifferentState(3)

Migrant(2+3)

.90to.909 44,568 75.8 14.3 9.9 24.2

.91to.919 25,779 75.2 14.7 10.1 24.8

.92to.929 38,949 76.1 14.1 9.9 23.9

.93to.939 44,984 76.7 13.8 9.5 23.3

.94to.949 40,668 77.0 13.5 9.5 23.0

.95to.959 38,060 77.1 13.3 9.6 22.9

.96to.969 69,252 76.9 13.5 9.6 23.1

.97to.979 70,961 77.5 13.0 9.5 22.5

.98to.999 7,016 77.6 13.2 9.3 22.4exactmatch 1,173,183 79.0 12.0 9.0 21.0

All 1,553,420 78.5 12.4 9.1 21.5HouseholdUniqueness

Score

NLinkedHHs

NonMigrant

(1)

SameStateDifferentCounty(2)

DifferentState(3)

Migrant(2+3)

<10 821,342 77.5 12.9 9.5 22.510-19 340,343 78.8 12.2 9.0 21.220-29 170,603 79.5 11.7 8.8 20.530-39 100,776 79.9 11.6 8.5 20.140-49 57,573 80.7 11.0 8.3 19.350-59 31,557 81.2 11.0 7.8 18.860+ 31,226 81.7 10.5 7.8 18.3

All 1,553,420 78.5 12.4 9.1 21.5

NpotentialLinksin

Household

NLinkedHHs

NonMigrant

(1)

SameStateDifferentCounty(2)

DifferentState(3)

Migrant(2+3)

3 561,326 76.8 13.0 10.2 23.24 541,916 78.3 12.6 9.1 21.75 272,968 80.0 11.7 8.3 20.06+ 177,210 81.9 10.8 7.3 18.1

All 1,553,420 78.5 12.4 9.1 21.5

Page 49: Evaluating the Accuracy of Linked U. S. Census Data: A

Table5a.HouseholdLinkageRate,1870-1880Complete-Count(all1880households)

Numberrulesbasedlinkedhouseholds 1,553,420

NumberofexplicitlylinkedIndividuals 6,473,809

NLinkablein1880Household

N1880Households

N1880HouseholdsLinked %Linked

1 934,251 0 0.02 4,354,712 0 0.03 1,788,843 375,655 20.94 1,280,185 428,408 33.45 832,111 354,198 42.5

6+ 889,856 395,159 44.3All 10,079,958 1,553,420 15.4

Table5b.HouseholdLinkageRate,1870-1880Complete-Count(1880householdswith3ormorelinkablerecordsonly)

RaceandNativity(HouseholdHead)

N1880Households

N1880HouseholdsLinked %Linked

Native-bornwhite 2,918,696 1,133,828 38.7Foreign-bornwhite 1,339,201 337,725 25.2Black 456,746 67,722 14.8Mulatto 71,053 13,872 19.5Other 5,299 273 5.2

All 4,790,995 1,553,420 32.4

Page 50: Evaluating the Accuracy of Linked U. S. Census Data: A

Table6.DistributionofNeighborCount(PHHN)ForAllPotentialHouseholdLinks(0.9SurnameThreshold)andForRules-BasedHouseholdLinks,1870-1880Complete-Count

AllPotentialHHLinks(0.9SurnameThreshold)

Rules-BasedHouseholdLinksOnly

NNeighborsinGrid(PHHN)

NPotentialHHLinks Distribution

NNeighborsinGrid(PHHN)

NLinkedHouseholds(Rules-based)

Distribution

%NonMigrant(linked

households)1 12,727,140 59.3 1 317,330 20.4 28.52 4,155,353 19.4 2 124,729 8.0 56.63 1,459,194 6.8 3 67,350 4.3 71.84 546,663 2.5 4 43,488 2.8 80.65 238,992 1.1 5 32,522 2.1 86.66 129,656 0.6 6 27,108 1.7 90.67 88,112 0.4 7 24,699 1.6 92.98 71,170 0.3 8 23,441 1.5 94.49 65,238 0.3 9 23,917 1.5 95.2

10 62,179 0.3 10 24,166 1.6 96.011 61,185 0.3 11 24,453 1.6 96.412 61,229 0.3 12 25,431 1.6 96.513 61,355 0.3 13 25,556 1.6 96.514 61,341 0.3 14 25,788 1.7 96.915 61,751 0.3 15 25,992 1.7 97.016 62,226 0.3 16 26,479 1.7 97.217 62,844 0.3 17 26,998 1.7 97.518 62,307 0.3 18 26,757 1.7 97.719 62,998 0.3 19 27,326 1.8 97.5

20+ 1,345,209 6.3 20+ 609,890 39.3 98.8

All 21,446,142 100.0 All 1,553,420 100.0 78.5

Page 51: Evaluating the Accuracy of Linked U. S. Census Data: A

Table7.NumberofLinkedHouseholdsandIndividuals,Rules-OnlyandRulesPlusPHHNGrids

LinkTypeNPotentialLinksin

Household

NHouseholdsLinked

NLinkedIndividuals

%NonMigrant(ByHouseholds)

rulesonly(0.9surname) 3+ 1,553,420 6,473,809 78.5rulesplus(0.9surname) 2 485,800 982,388 97.5rulesplus(0.9surname) 3 87,326 266,460 97.6rulesplus(0.9surname) 4+ 36,400 211,712 96.5rulesonly(0.8surname) 3+ 144,469 879,008 76.9rulesplus(0.8surname) 2 20,418 62,193 96.9rulesplus(0.8surname) 3 65,009 130,914 97.4rulesplus(0.8surname) 4+ 8,902 50,911 94.9

All

2,401,744 9,057,395

Page 52: Evaluating the Accuracy of Linked U. S. Census Data: A

Table8.MigrationStatusforRules-BasedHouseholdLinks,1870-1880Complete-Count(Surname0.8only)

SurnameSimilarity

NLinkedHHs

NonMigrant

(1)

SameStateDifferentCounty(2)

DifferentState(3)

Migrant(2+3)

.80to.809 10,870 75.6 14.7 9.7 24.4

.81to.819 6,908 76.9 13.7 9.5 23.1

.82to.829 16,513 75.9 14.1 10.0 24.1

.83to.839 9,119 76.9 13.6 9.5 23.1

.84to.849 14,697 76.2 14.1 9.7 23.8

.85to.859 15,230 76.7 13.7 9.6 23.3

.86to.869 20,393 77.9 12.7 9.3 22.1

.87to.879 10,765 76.8 13.5 9.7 23.2

.88to.889 19,506 77.5 13.2 9.3 22.5

.89to.899 20,468 77.8 12.8 9.5 22.1

All 144,469 76.9 13.5 9.6 23.1HouseholdUniqueness

Score

NLinkedHHs

NonMigrant

(1)

SameStateDifferentCounty(2)

DifferentState(3)

Migrant(2+3)

<10 65,579 76.4 14.1 9.5 23.610-19 28,190 77.5 13.1 9.4 22.520-29 21,886 77.2 12.9 10.0 22.830-39 12,690 76.8 13.5 9.7 23.240-49 7,500 77.5 12.5 10.1 22.550-59 4,126 78.3 12.4 9.3 21.760+ 4,498 81.7 10.5 7.8 22.1All 144,469 76.9 13.5 9.6 23.1

NpotentialLinksin

Household

NLinkedHHs

NonMigrant

(1)

SameStateDifferentCounty(2)

DifferentState(3)

Migrant(2+3)

3 20,871 76.1 13.3 10.6 23.94 71,348 75.5 14.5 10.0 24.55 32,911 78.3 12.6 9.1 21.76+ 19,339 80.1 11.8 7.3 19.1

All 144,469 76.9 13.5 9.6 23.1

Page 53: Evaluating the Accuracy of Linked U. S. Census Data: A

Table9a.HouseholdLinkageRate,1870-1880Complete-Count,RulesandRulesPlusPHHNGrids

Numberrulesbasedlinkedhouseholds 2,401,744

NumberofexplicitlylinkedIndividuals 9,057,395

NLinkablein1880Household

N1880Households

N1880HouseholdsLinked %Linked

1 934,251 0 0.02 4,354,712 305,461 7.23 1,788,843 585,676 33.64 1,280,185 572,697 45.85 832,111 442,524 54.5

6+ 889,856 495,387 57.0All 10,079,958 2,401,744 23.9

Table9b.HouseholdLinkageRate,1870-1880Complete-Count,RulesandRulesPlusPHHNGrids(1880householdswith2ormorelinkablerecordsonly)

RaceandNativity(HouseholdHead)

N1880Households

N1880HouseholdsLinked %Linked

Native-bornwhite 5,729,709 1,743,796 30.8Foreign-bornwhite 2,307,905 510,277 22.3Black 943,820 123,500 13.2Mulatto 151,489 23,622 15.8Other 12,784 549 4.3All 9,145,707 2,401,744 26.3

Page 54: Evaluating the Accuracy of Linked U. S. Census Data: A

Table10a.Linkedpopulation’sdistributionbysurnamesimilaritymeasures

N Dist.

(%) NYSIIS DoubleMeta Match1 Match2 Match3

0.80to0.849 467,651 4.6 29.7 35.9 82.4 46.8 21.3

0.85to0.899 648,388 6.4 43.8 50.1 90.6 65.7 32.7

0.90to0.949 1,127,925 11.1 54.4 61.7 96.8 84.0 68.9

0.95to0.999 1,076,914 10.6 71.6 74.0 99.7 94.8 86.7

1.00(Exactmatch) 6,920,409 67.4 100.0 100.0 100.0 100.0 100.0

Total 10,241,287 100.0 85.1 86.9 98.2 93.1 87.3

Table10b.DistributionbyJaro-Winklerscoreforgivennames

N Dist.(%)

NNameStd.

%NameStd.(byrow)

Lessthan0.6 554168 5.4 151,288 27.3

0.60to0.649 64808 0.6 6,092 9.4

0.65to0.699 115222 1.1 37,332 32.4

0.70to0.749 162728 1.6 25,386 15.6

0.75to0.799 281654 2.8 92,664 32.9

0.80to0.849 274735 2.7 81,871 29.8

0.85to0.899 419735 4.1 218,682 52.1

0.90to0.949 783995 7.7 61,936 7.9

0.95to0.999 500938 4.9 40,576 8.1

1.00(Exactmatch) 7083299 69.0 0 0.0

Total 10,241,287 100 715,827 7.0

Page 55: Evaluating the Accuracy of Linked U. S. Census Data: A

Table11.Distributionofage,birthplace,sexandraceprecision

B. Birthplaceagreement

N Dist.(%)

Agrees 9,983,344 97.5

Disagrees 257,943 2.5

Total 10,241,287 100.0

C. Sexagreement

N Dist.(%)

Agrees 10,200,689 99.6

Disagrees 40,598 0.4

Total 10,241,287 100.0

D. Raceagreement

N Dist.(%)

Agrees 10,179,031 99.4

Disagrees 62,256 0.6

Total 10,241,287 100.0

A. Agedifference

N Dist.(%)

−5(andgreater)years 191,952 1.9

−4years 134,355 1.3

−3years 240,630 2.3−2years 564,267 5.5

−1year 1,957,802 19.1

Sameage 4,940,063 48.2

+1year 1,337,226 13.1

+2years 402,158 3.9+3years 178,233 1.7

+4years 105,844 1.0

+5(andgreater)years 188,757 1.8

Total 10,241,287 100.0