evaluating the accuracy of linked u. s. census data: a
TRANSCRIPT
†Correspondence should be directed to: Diana Magnuson University of Minnesota, 50 Willey Hall, 225 19th Ave S., Minneapolis, MN 55455 e-mail: [email protected], phone: 612-624-5818, fax:612-626-8375
"Evaluating the Accuracy of Linked U. S. Census Data: A
Household Linking Approach"
The Systematic Linking of Historical Records, University of Guelph,
May 10-13, 2017
Ronald Goeken University of Minnesota
Yu Na Lee
University of Minnesota
Tom Lynch University of Minnesota
Diana Magnuson† Bethel University
December 2017
Working Paper No. 2017-1
1
Introduction
Despitetheproliferationofpublishedstudiesusinglinkeddecennialcensusrecordstherehas
beenlittleempiricalworkontheaccuracyofthelinkeddata.Theprimaryreason,ofcourse,is
thatyoucanneverdefinitivelystatethattworecordstakenfromtwodistinctcensusesrepresent
thesameperson.Giventheabsenceofuniqueidentifiers(e.g.,socialsecuritynumbers)
matchinghistoricalcensusrecordsdependsonhighsimilaritybetweenprimarylinkage
variables;e.g.,names,age,sex,andplaceofbirth.Potentiallinksarethenclassifiedastrueor
falseaccordingtorulesormachinelearningprocedures.Estimatinglinkageratesisa
straightforwardexercise,buterrorratescanonlybemeasuredindirectly.
Thegoalofmosthistoricalcensuslinkageprojectsistocreatelinkeddatathatdoesnotinclude
corroborativeevidencederivedfromco-residentkinandmigrationstatusbecauseofbiasissues.
Thisisavalidconcern,butitisalsopossiblethatrelyingonlinkagemethodsthatignoreafair
amountofcorroborativeevidencecomesatacost.Theobviouseffectwouldbetolowerlinkage
rates.Apotentiallymoresignificantconcernwouldbetheeffectonerrorrates.Themainissueis
ifthetruelinkisunidentifiable(becauseofunder-enumerationoramismatchorlowsimilarity
onkeylinkagevariables),thenanylinktothisrecordwillbefalse.
Mostrecordlinkageprojectsmoreorlessassumethattheinabilitytofindtruelinksdueto
mismatchesorlowsimilarityforkeylinkagevariablesisarelativelyminorissue.Ourstrategyfor
investigatingthistopicistouseamaximumamountofinformationtoestablishasetofverified
links.Primarily,weplanonusingthepresenceofcommonkinandresidentialstability(i.e.,living
inthesameplace)insuccessivedecennialcensusestosupplementsimilarityattheindividual
level.Althoughmanytruelinkswillnothavecorroborativehouseholdorresidentialinformation,
wefindthatmanycanbeverified.Theseverifiedlinkswillthenbeusedtooptimizeblocking
strategiesandtotestproceduresusedtoclassifypotentiallinksgeneratedbyindividuallevel
classifiers,primarilybyconstructinglinkageanderrorrates.
Thisisstillourbasicmissionstatement.However,thenineteenthcenturylinking--whichispart
ofafive-yearprojectexaminingdemographicchangeintheaftermathoftheAmericanCivil
2
War--isstillinprogress.Weprovideastatusreportinthelasthalfofthepaper,butinthefirst
halfwediscussthedevelopmentofthehouseholdlinkingprocess.1
The1880Complete-CountLinkageProject(2003-2009)
In2003theMinnesotaPopulationCenterbeganworkonaprojectthatwouldeventuallylink
thecomplete-countdatabaseofthe1880U.S.populationcensustosamplesofother19thand
early20thcenturyU.S.decennialcensuses.Theoriginalgrantassertedthatwewouldestablish
linksattheindividuallevelandonlyuseasetofvariablesthatwouldminimizelinkingbias;i.e.,
names,age,sex,race,andplaceofbirth.Wedidnotuseplaceofresidenceorinformation
gleanedfromco-residentkinbecauseofbiasconcerns(i.e.,thatnon-migrantsandthoseliving
withthesamekininbothcensuseswouldbeoverrepresentedinthelinkedpopulation).2
Thedecisiontoignorecorroborativeevidence(becauseofbiasconcerns)ultimatelyresultedin
thechoiceofaconservativelinkingstrategy.Thefinallinkageratesweremodest,butwefelt
thiswasnecessaryinordertoachieve(relatively)lowfalsepositiverates.Althoughwedidnot
possessa“truth”sampleforverification,indirectevidenceindicatedwehadrelativelylowfalse
positiverates.Forexample,ifweindependentlylinkedtwobrotherswhowereco-residentin
the1880census,rarelyweretheyalsonotco-residentin1870(i.e.,setsofbrothercamefrom
thesamehouseholdsinbothcensusyears).Anotherexamplewouldbeconsistencyinour
male-onlyandcouple-onlylinkedsamples;ifamalefromthe1880censuswaslinkedinbothof
thesesamples,werarelyhadthisindividuallinkedtotwodifferentrecordsinthe1870census.3
Bothofthesediagnosticsofferevidenceofconsistencyandindirectlyimplyprecision.Theyalso
cherry-pickabit,inthattheselecteduniversewasnative-bornwhitesin1880;itislikelythat
errorrateswerehigherforAfricanAmericansandtheforeign-born(specificallytheIrish).Itis
1Hacker,J.David.PrincipalInvestigator."ModelsofDemographicandHealthChangesFollowingMilitaryConflict"1R01HD082120-01.NationalInstituteofChildHealth/HumanDevelopment.2Ruggles,Steven.PrincipalInvestigator."PopulationDatabasefortheUnitedStatesin1880."R01HD39327,NICHD-DBSB.3RonaldGoeken,LapHuynh,T.A.LynchandRebeccaVick,“NewMethodsofCensusRecordLinking,”HistoricalMethods:AJournalofQuantitativeandInterdisciplinaryHistory,volume44,issue1,2011.StevenRuggles,“LinkingHistoricalCensuses:ANewApproach,”HistoryandComputing,volume14,March2002,pp.213-244.
3
alsoprobablethatsomedemographicsub-groupsmightbemorelikelytohaveconsistent
informationinsuccessivecensusesandthusbemorelikelytobeaccuratelylinked(andthis
wouldprobablyapplytomarriedmenandchildren).Wewerealsomoreconfidentinour1870-
1880linkedsamplecomparedtolinkedsampleswithinter-censalgapsexceedingtenyears(i.e.,
weexpectfalsepositiveratestoincreaseastheyearsbetweenlinkedcensusestoincrease).
Anotherreasonwethoughtwehadrelativelylowfalsepositiverates(atleastformarriedmen
andsons)wasbecausewespentsometimevisuallyevaluatingthelinkedhouseholds.Although
welinkedontheindividualbasis(primarylinks),theresultinglinkeddataconsistsofthe
primarylinksalongwiththeirco-residenthouseholdmembersfromthetwospecificcensuses.
Ifanyofthenon-primaryrecordsappearedtobethesamepersonintherespectivecensuses,
thenweestablishedthelinkbasedonasetofrules(withtheselinkedrecordsidentifiedas
secondarylinks).4
Manyofthe1870-1880primarylinksdonothaveco-residentsecondarylinksforobvious
reasons;anexamplewouldbea24-year-oldsonlivingwithhisparentsin1870linkedtoa34-
year-oldhouseholdheadlivingwithhiswifeandthree-year-olddaughterin1880.Butmanyof
theprimarylinkshaveco-residentsecondarylinks;inthemale1870-1880linkedsample28
percentoftheprimarylinkshavenosecondarylinks,19percenthaveoneand53percenthave
twoormoresecondarylinks.Althoughwedidnotdoasystematicanalysis,itispossibletopick
outlow-qualityprimarylinks,andthetoppanelinFigure1givesanexample.Heretheprimary
linkisHenryMcHugh,age14in1870andage25in1880.However,nootherrecordineither
householdappearstobethesameperson(withtheonlyrealpossibilitybeingJamesE,age3in
1870andEdwardJ.,age12in1880).Butanexamplelikethisisrelativelyrareinthe1870-1880
malelinkedsample.Muchmorecommonwouldbethelinkedrecordsinthesecondpanel.
HeretheprimarylinkisJamesFelkins,age61in1870(linkedtoJamesH.Felkin,age71in
1880).AndthreeofJames’kinaresecondarylinksandtheyappeartobecorrectlylinked
despitethedifferencesinexpectedage.Infact,thereisalsoahighprobabilitythatMartha,age
53in1870isthecorrectlinktoMatilda,age67in1880.
4Seehttps://usa.ipums.org/usa/linked_data_samples.shtml.
4
WhetherMarthaisactuallyMatildaillustratesabasicdilemmawithestablishingsomeofthe
secondarylinks;theycouldbethesameperson,butmaybeorprobablynot(itisdefinitely
possiblethatJamesre-marriedtoMatildaatsomepointbetween1870and1880).Butdespite
asomewhatconservativestandardforestablishingsecondarylinksincasesofambiguity,the
secondarylinkshadlowerlevelsofsimilaritycomparedtoourprimarylinks.Forexample,inthe
1870-1880malefile,lessthan1percentoftheprimarylinkshaveanexpectedagedifference
exceedingoneyearofage.Forsecondarylinks,over20percenthaveanexpectagedifference
oftwoyearsofageormore.
Thehigherprecisionforourprimarylinksresultedfromourconservativelinkagestrategy.To
useasimplifiedexample,ifwehadtwopotentiallinksforagivenrecord,withonepotential
linkbeinganexactmatchonalllinkagevariablesandtheotherbeinganexactmatchonall
variableswiththeexceptionofanexpectedagedifferenceoffouryears,wewouldrejectboth
potentiallinksbecauseofambiguity(yes,thepotentiallinkwithanexactagematchwouldhave
ahigherprobabilityofbeingthetruelink,butwetookaconservativelinkingapproach).In
addition,ifouronlypotentiallinkwasanexactmatchexceptforanexpectedagedifferenceof
fouryears,wewouldrejectbecauseoflowprecision.Inotherwords,wehadatwo-threshold
approach,withthehigherthresholddeterminingeligibilitytobeaprimarylink,andthelower
thresholdidentifyingtheareaofambiguity;alinkwasdefinedasone-and-only-onepotential
linkabovethehigherthreshold,andnootherpotentiallinksabovethelowerthreshold.This
resultedinfairlyaccurateresults,butalsomeantthatourprimarylinkswerenotrepresentative
ofalltruelinks.Thisfinding,alongwiththeunderstandingthatmanyprimarylinkscanbe
verifiedthroughthepresenceofconsistentco-residentkininbothcensusyears,were
importantinsights,butwereallydidnotappreciatethisuntilwewerefinishedwiththe1880
complete-countlinkageproject.
LinkingSlave-Ownerstothe1850Complete-CountPopulationDatabase
Ournextlinkageprojectwasthe1850complete-countdatabaseofthe1850U.S.Census,
5
whichwasacollaborationwiththeChurchofJesusChristofLatter-DaySaints(LDS).5Inaddition
tothepopulationrecords,LDShadalsoenteredthe1850slaveschedules.Theslavecensushas
theslaveownernamesandwewantedtolinktheslaveowners(andtheirslaves)totheslave
owner’spopulationrecord.Thepopulationandslaveenumerationsweredonesimultaneously,
soslaveownersintheslaveschedulesandthepopulationschedulesshouldbe(roughly)inthe
sameorderintheirrespectivedatabases.However,itbecameapparentthatsomeslave
schedulepagesweremicrofilmedoutoftheiroriginalorder(andtherearenopagenumbersor
enumeratorsequencenumberstoverifythesort;thepageshaveanenumerationdatefield,
butthisinformationwasoftenmissingandwasnotconsideredtobeincrediblyreliable).The
formsdonothaveinformationforslaveownerage,birthplaceorsex(andabout20percentof
slaveownersin1850werefemale).Theonlyowner-relatedinformationontheslaveschedules
isslaveownername,buttheformshavelegibility(andtranscription)issuesandgivenname
oftenconsistsofasingleinitial.
Herewewerenotconcernedaboutbiasinlinkagemethods;thegoalwastoaccuratelylinkall
oftheslaveownerstotheirrespectivepopulationrecords.Thebasicrulewasthatslaveowners
andtheirpopulationrecordwouldusuallyfollowapproximatelythesamesequenceinboth
schedules(withsomeexceptionsduetoabsenteeslaveowners).Butwehadtoidentifymini-
sequenceswithincountieswhentheslaveschedulepageswereoutoforder.Todothiswe
blockedbycountyofresidenceandrestrictedpotentiallinkstorecordsage17andolderinthe
populationdata,andwroteoutpotentiallinksthatexceededapresetthresholdforgivenand
surnamesimilarity.
Thecreationofslaveownersequencesreliedonidentifyingclustersofpotentiallinks(i.e.,a
highproportionofslaveownersfromaslavepageorrangeofslavepagesthathadpotential
linkstoagivenpageorrangeofpagesinthepopulationdata).Figure2showsan1850slave
schedulepage;aslavepagehas84linesforindividualslaves,andthispagehas14slave
holdings(i.e.,14slaveowners).Thepopulationscheduleshave42linesperpagein1850and
WashingtonCounty,Missourihadafreepopulationof7,736in1850;thusWashingtonCounty,5Alexander,JosephTrent.PrincipalInvestigator."BaselineMicrodataforAnalysisofU.S.DemographicChange.PRF601864.NationalInstituteofChildHealth/HumanDevelopment.
6
Missourihadapproximately190pagesofpopulationdatain1850.Again,theonlyinformation
usedtoestablishthelinkisname(i.e.,wedonothaveage,birthplaceorsexfortheslave
owners).Thebasicconceptwasthatthe14slaveownerswouldhaverandompotentiallinks
dispersedovertheentirecounty(prettymuchanywhereonpages1through190inthe
populationdataforthisexample).Buttypicallywecouldidentifyaclusterofpotentiallinksona
givenpageorrangeofpagesinthepopulationdata;probablynotall14,butwewouldsee
clusters,whichwouldindicatethatthesepotentiallinkswerethetruelink(evenifanother
potentiallinkelsewhereinthecountyhadgreaternamesimilarity;inotherwords,sequence
orderwasoftenabetterpredictorofthetruelinkthannamesimilarity).
Weeventuallybegantounderstandthatwecouldapplytheslaveownersequencinglogicto
linkingthepopulationrecordstakenfromtwodistinctcensusesonthehouseholdbasis.
Basically,ahouseholdisasubsetofapageofpopulationdata.Andhouseholdmembersare
similartoagroupofslaveownersonagivenpageofslavedata.Theanalogybreaksdownabit
whendealingwithindividualsenumeratedtenyearsapart.However,asmentionedabove,
undercertaincircumstanceswewouldexpectsomeco-residentialstability.Basically,ifwefind
certaincombinationsofnuclearkinagetenandolderco-residinginagivencensusyear,thereis
averyhighprobabilitytheywerealsoco-residingtenyearsearlier.Forexample,the
expectationisthatahouseholdhead,spouseandtwoteen-agedsonsinthe1880censuswill
alsohavebeenenumeratedtogetherinthesamehouseholdinthe1870census.Atthe
individualleveleachofthefourrecordscouldhavemultiplepotentiallinkstothe1870census,
butthetruelinkwouldbeidentifiablebecauseitwouldbethehouseholdcombinationthatalso
hadpotentiallinksforothermembersofthehousehold.Again,thecorrecthouseholdmight
nothavepotentiallinkstoallfour,butthreeoutoffourprobablywouldbeenoughtoestablish
andconfirmthelink.
HouseholdLinkingtheTwoEnumerationsofSt.Louisin1880
Oneissuewiththisapproachisthelargenumberofindividualpotentiallinksthatneedtobe
generatedinordertoestablishthehouseholdlinks.Mostofourpotentiallinkswillbetheonly
7
linkbetweenspecifichouseholdsintwodifferentcensuses(andwillnotbeatruelink),butwe
havenowayofknowingthisuntilwegenerateandprocessallofthepotentiallinks.And
workingwiththecomplete-counttabulationswouldrequireimprovementsinourprocessing
speed.
Wealsohadtodevelopanactualprocess,whichevolvedduringworkwedidlinkingthetwo
enumerationsofSt.Louisin1880.ThefirstenumerationoccurredinJuneand,becauseof
allegationsofanundercount,theCensusOfficeauthorizedasecondenumerationinNovember
ofthesameyear.ThiswasnotthefirsttimethatanAmericancitywouldbere-enumerated,
norwoulditbethelast.6ButSt.Louisin1880appearstobeuniqueinthatthesecond
enumerationwasanattemptatacompletere-enactment;thesameenumerationsheetswere
usedinbothenumerationsandenumeratorswereexpectedtocompleteallofthecensus
questions.7BothenumerationsalsousedthesameJune1referencedate.Theenumerator
instructionsfortheNovemberrecountstatethat“enumeratorswillnotaskthepeopleoftheir
districtwhethertheyhavechangedtheirresidencesinceJune1,1880,buttheymustask,
“WereyouresidentsofSt.Louisonthe1stofJune?”or,“WasSt.Louisyourhomeonthe1stof
June,1880?…enumeratorswillmakenoinquiriesastoremovalsfromonefamilytoanother,
andfromonedistricttoanothersinceJune1(assuggestedinmycircular);buttheymustbe
veryparticulartoask,“HasanymemberofthisfamilyorhouseholdleftthecitysinceJune1,
1880?”and“HasanypersonorfamilymovedfromthecityfromthisneighborhoodsinceJune
1,1880?”8
TheuseoftheJune1referencedatefortheNovemberenumerationraisesanumberofissues
regardingtheaccuracyoftheresults.TheenumerationofindividualswhowerepresentonJune
1buthadsubsequentlyleftthecitywoulddependonrelatives,neighborsorlandlords
reportingthisinformationtoenumeratorsaswellasgivingtheminformationonthemigrants’
individualcharacteristics.Enumeratorswerealwaysdealingwiththeseissues,andgettingfairly
6 FrancisA.Walker,ACompendiumoftheNinthCensus(Washington,D.C.:GPO,1870),pp.xx-xxi.7 Forexample,NewYorkCityandPhiladelphiahadrecountsin1870.Inbothcasesenumeratorswereonlyexpectedtofillinasubsetofthequestionsontheoriginalenumeratorsheets.8“TheCensus:RevisedInstructionsIssuedtotheEnumerators--OneDistrictAlreadyFinished,”St.LouisPostDispatch,November9,1880.
8
accurateinformationonabsenteeresidentswouldnotbeaninsurmountabledifficultyifthe
respondenthadsomefamiliaritywiththeabsentees.Butthefive-monthgapbetweenthe
referencedateandtheactualenumerationwouldmakeitdifficulttogetanexactcountand
preciseinformationonrelativelytransientpopulationsub-groups:extendedkinandunrelated
individualsingeneral,andthoseresidinginhotelsandlargerroomingandlodging
establishmentsmorespecifically.Butthisshouldnotaffectourabilitytolinkthedata.In
contrasttorecordstakenfromtwoseparatedecennialcensuses,thetwoenumerationsofSt.
Louisconstitutearelativelycloseduniverse;weexpecttofindthesameindividualslivingwith
eachother.Inaddition,wehavestreetaddressesforbothenumerations.Althoughsome
individualswouldrelocate(withinthecity)betweenthetwoenumerations,theaddresses
wouldproveusefulinthelinkingprocess.Theuseofcorroborativeevidenceintheformofco-
residentkinandstreetaddressundoubtedlyproducesbiasedlinkageresults.Butthisissueis
notimportantherebecauseourgoalistolink,totheextentpossible,alloftherecords.
Ourlinkageapproachconsistsofinitiallyestablishingpotentiallinksattheindividuallevel.
Namesarecleaned(i.e.,non-alphacharactersareremoved)andparsed(i.e.,thegivenname
‘MaryE’becomesname1=‘Mary’andname2=‘E’).Recordsareblockedbysexandsimilarity
scoresbasedontheJaro-Winkleralgorithmarecalculatedforgivennameandsurname.9
Recordpairshavingasurnamesimilarityscoreofatleast0.9,agivennamesimilarityscoreofat
least0.7,andanabsoluteagedifferenceoflessthanfiveyearsareselectedaspotentiallinks.
Wedidnotstandardizegivennames,nordidweusebirthplaceorraceasablockingfactor.
Somenamestandardsarefairlyobvious,butwedecidedtoempiricallydeterminethe
appropriatestandardsbasedonourinitiallinksratherthanimposestandardsbasedon
assumptions.Wehopedtousestreetaddresstofacilitatethelinking,butourinitialattemptsto
linkonthebasisofmatchingstreetandhousenumberinformationproducedrelativelyfew
qualitymatches.Inaddition,wehaveenumerationdistrictinformation,buttherewere168
enumerationdistrictsinthefirstenumerationcomparedto450inthesecond.Forthatreason
weinitiallydidnotusedistrictinformationtolinkrecords.
9PeterChristen,DataMatching:ConceptsandTechniquesforRecordLinkage,EntityResolution,andDuplicateDetection,Springer,2012.http://link.springer.com/book/10.1007%2F978-3-642-31164-2.
9
AlthoughwefindfarfewerexactornearduplicatesinSt.Louisthanwewouldifweweretrying
tolinktheentirecountry,wenonethelessencounterafairamountofambiguitywhenlooking
atpotentiallinksontheindividuallevel.Muchofthisambiguityiseliminatedifwetakeinto
accountcharacteristicsofco-residentfamilymembers.Forexample,inthefirstenumeration
wehavea‘JohnO’Donnell’whowas43yearsold.Restrictingthepotentiallinkstoexactname
matchesandamaximumagedifferenceoffouryears,wehavethreemennamedJohn
O’Donnellinthesecondenumerationwithagesof45,45and46(seeFigure3).Weknowthat
the43-year-oldinthefirstenumerationisactuallythe46-year-oldinthesecondenumeration
afterwetakeintoaccountinformationfromotherhouseholdmembers.
Ratherthancreatingvariablesforeachindividualpertainingtoinformationgleanedfromco-
residentkin(e.g.,father'sname,father'sage,mother’sname,mother’sage,etc.)wecreate
potentiallinksforeachindividualusingthesimplemethodoutlinedabove.Thenwesumthe
numberofpotentiallinksbetweenspecifichouseholdsinthetwoenumerations.Usingthe
O’Donnellexample,eachhouseholdmemberinthefirstenumerationhasnumerouslinksto
individualrecordsinthesecondenumeration.Formostofthesepotentiallinks,however,only
oneofthehouseholdmembershasalinktoaspecifichouseholdinthesecondenumeration.
DespitetheinconsistentageforJohnO’Donnellinthetwoenumerations(age43and46),we
knowthatthisisthecorrectlinkafterdeterminingthathisspouseandchildrenalsohave
potentiallinksbetweenthesetwohouseholds.
Thisprocessalsoallowsustoestablishlinksevenifsomeofthehouseholdmembersdonot
havepotentiallinksinourinitiallinkingpass(seeFigure4).Inthishouseholdthefirstandthird
membersofthetwohouseholdswerenotinourinitialpotentiallinksfilebecauseoflowgiven
namesimilarity(Autonia-AntonhasaJaro-Winklersimilarityscoreof0.699,whichisbelowthe
0.7threshold)andexcessiveenumeratedagedifference(Anniewas19yearsoldinthefirst
enumerationand14yearsoldinthesecondenumeration).However,afterdeterminingthat
therearefourotherlinksbetweenthesehouseholds,wecanalsoestablishlinksfortherecords
thatwerenotinitiallylinkedontheindividualbasis.
10
Weestablishedlinksbetweenhouseholdsbasedonthefollowingrules.First,ifwehavefouror
morepotentiallinksbetweenspecifichouseholdsinthetwoenumerations,andeachofthe
householdswithfourormorepotentiallinksdidnothavetwoormorelinkstoanyother
household(intheotherenumeration),thenweflaggeditasalinkedhousehold.Second,we
alsoacceptedhouseholdswiththreepotentiallinks,ifneitherofthesehouseholdshadtwo
linkstoanyotherhousehold.Finally,wereviewedourworkbyvisuallyinspectingthe
householdswiththelowestcompositesimilarityfornamesandageorlinkedhouseholdswitha
majorityofhouseholdmembersunlinkedattheindividuallevel.Usingthisapproachwewere
abletolinkaboutonethirdofthefirstenumerationhouseholds;21,214outof63,325
householdsand99,147outof276,683relatedindividuals.
Thismethodonlyworksonrelatedindividualsandwillnotlinksmallerhouseholds.However,
afterestablishinghighqualitylinkedhouseholds,wesetthemasideandmadeadditionalpasses
throughthedata(discussedbelow).Wealsousedthevisualreviewprocesstoassesswhymany
householdsremainedunlinked.Aprimaryreasonwaslowerlevelsofsurnamesimilarityfor
unlinkedhouseholds,whilesomesmallerhouseholdswereunlinkedbecauseoftheuseof
diminutivesorabbreviationsforgivennamesinoneenumerationortheother.Wealsobegan
toexplorewaystouseplaceofresidencetoeitherverifyorlinkhouseholdswithrelativelylow
similarity.Forexample,somelinkedhouseholdshadstreetagreement,buttheirhousenumber
wasoffslightly(e.g.,2402MarketSt.inoneenumerationversus2404MarketSt.intheother).
Inaddition,someofourinitialsetofhouseholdlinkshadhousenumberagreement,butthe
streetnamedisagreed.Anexaminationofthelinkedhouseholdsidentifiedmanystreetname
correctionsandwewerealsoabletoconstructanenumerationdistricttranslationtable
betweenthetwoenumerations.Althoughmanylinkedhouseholdshadstreetaddress
disagreement,almostallofthelinkedhouseholdsthathadidenticaladdressinformation
residedinoneofasetofcontiguouslynumbereddistrictsinthesecondenumeration
correspondingtoasingledistrictinthefirstenumeration.Thecorrectiontoaddressesandthe
useoftheenumerationdistrictequivalentsallowedustolinkhouseholdsthathadbeen
difficulttolinkbecauseoftheirsmallsizeorbecauseoflowsurnamesimilarity.
11
Asecondgroupofpotentiallinkswasgeneratedusingthesamethresholdsusedintheinitial
pass,exceptweloweredthesurnamethresholdtoaJaro-Winklerscoreof0.7andapplied
someempirically-derivednamestandardstothegivennames.Wethengeneratedasecond
batchofhouseholdlinksusingrulesbasedonthenumberofpotentiallinksbetweenspecific
householdsinthetwoenumerations.Afteridentifyinghigherqualityhouseholdlinks,the
householdlinkingrulesallowedlessprecisioniftherewassomeevidenceofresidential
persistence;eitheridenticaladdressinformationorsimilaraddressandresidinginthesame
enumerationdistrictequivalent.
Figure5showstwoexamplesoflinkedhouseholdswithsurnamesimilaritybelowourinitial
thresholdof0.9.ThesurnamecombinationofBurgherdt-BurkhartgeneratesaJaro-Winkler
scoreof0.86,alevelgenerallysufficienttoestablishalinkifotherlinkingvariablesalsohad
relativelyhighsimilarity.And,afterlookingattheentirehousehold,itisobviousthatthese
householdswerecorrectlylinked.ThesecondhouseholdinFigure5isalsolinked,buthasa
surnamesimilarityof0.67.Herewesuspectthatanindividuallinkwiththesurname
combinationofFitzgerald-Vetzgurawouldberejectedbymostclassifiers.Afterlookingatthe
householdcomposition,however,weconcludethatthesearethesamepeople.Anydoubtsare
alleviatedbylookingatthehouseholdhead’soccupation(“stonemason”inboth
enumerations)andstreetaddress(thehouseholdwasenumeratedat2405DivisionStreetin
bothenumerations).
Occupationalinformationwasneverexplicitlyusedtoestablishlinks.Butwebegantouse
streetaddressandenumerationdistrictinformationtolinkhouseholds,andthiswasusefulin
establishinglinksbetweensmallerhouseholds(especiallyone-andtwo-personhouseholds).
ThebottomtwolinkedhouseholdsinFigure5giveacoupleofexamples.Thefirsthousehold
hasasurnamesimilarityof0.63,andlinkingisfurthercomplicatedbythehead’sgivenname
(Frankvs.F.H.inthetwoenumerations).However,thesehouseholdswereenumeratedatthe
sameaddress,andareverylikelythesamepeople(withadditionalcorroborationprovidedby
theheadhavingtheoccupationof‘RetailGrocer’inbothenumerations).Thesecondlinked
householdinFigure5hashighersurnamesimilarity(0.85)butlinkingiscomplicatedbythe
12
head’sgivenname(Carolinevs.Catherineinthetwoenumerations).Althoughtheydonothave
identicalstreetinformation(4thStreetvs.5thStreet)theydohaveidenticalhousenumber
informationandwereenumeratedinthesameenumerationdistrictequivalents,whichwas
enoughofatelltoestablishthelink(wealsohaveoccupationalsimilarityforthehead’s
occupation:Caroline’slistedoccupationwas“KeepMillineryStore”andCatherinewasa
“Milliner”).
Therules-basedsystem,withitsshiftingthresholdsandmanualintervention,undoubtedly
introducesbias.However,weareprimarilyinterestedinmaximizingthenumberoflinksand
makingsurethattheyarecorrectlinks.Althoughwehavenotfinishedourworklinkingthe
relatedindividuals,Table1showsthatwehaveestablishedlinksfor78percentofthe
householdsinthefirstenumerationand74percentofthehouseholdsinthesecond
enumeration,whichcorrespondsto80percentoftherelatedindividualsinthefirst
enumerationand76percentoftherelatedindividualsinthesecond.Asexpected,givenour
householdlinkingapproach,wehavemoresuccesslinkinghouseholdsthatcontainmore
relatedindividuals.Someofthecurrentlyunlinkedhouseholdscannotbelinkedbecausethe
householdismissingfromoneenumerationortheother.Weanticipate,however,increasing
ourlinkageratethroughtrialanderrorandtheprocessofelimination.Someoftheunlinked
householdshavesurnamesimilaritybelowthethresholdsusedthusfar,andwecontinueto
modifyourrulestolinkthesmallerhouseholds.Inaddition,inthefuturewewillattempttolink
theunrelatedpopulation,althoughwesuspectthatmanyoftheboardersandlodgerswillbe
unlinkableduetotheabsenceofcorroborativeinformationsuppliedbyco-residentkin.
Table2agivesthelinkedpopulation’sdistributionbysurnamesimilaritymeasures.Athigher
levelsofsimilaritywewouldtypicallyassumeapotentiallinkwiththatcombinationof
surnameswouldbeatruelinkgivensufficientsimilarityforotherlinkagevariables(e.g.,given
name,age,birthplace,andsex).Thisassumptionbeginstobreakdownasweseelesssimilarity
inthesurnamecombinations.Figure6givesexamplesofsurnamecombinationsfromtheSt.
LouislinkedrecordsalongwiththeJaro-Winklerscore,phoneticcodesandmatchedletter
metrics.Thereisnoabsoluterulefordecidingatwhatpointthesimilaritybetweensetsof
13
linkedsurnamestransitionsfrom“plausible”to“maybe”to“doubtful.”BasedonFigure6,the
transitionfrom“maybe”to“doubtful”probablybeginsaround0.8Jaro-Winklersimilarity.And
thismeansover11percentofourlinkswouldbetreatedwithafairlyhighlevelofscepticism
withoutthecorroborativeinformationfromotherco-residenthouseholdmembers(or
consistentplaceofresidenceinformation).
InadditiontoJaro-Winklerscore,Table2agivessurnamematchingratesfortwophoneticcode
algorithms,NYSIISanddoublemetaphone.Wealsoconstructmeasuresindicatingwhetherthe
firstletter,thefirsttwoletters,andthefirstthreelettersofasurnamematchforourlinked
records.Almosthalfofthelinkedrecordsareperfectmatches,andallofthesewouldalsobe
consideredmatchesusingthephoneticcodesandmatchingletterstechniques.However,for
linkedrecordswithnon-exactmatchesforsurnamebutaJaro-Winklerscoregreaterthan0.95,
66percentwouldbeamatchusingNYSIISand74percentwouldbeamatchusing
doublemetaphone.10Overall,69percentofthesurnamecombinationsofthelinkedpopulation
haveaNYSIISmatchcomparedto73percentfordoublemetaphone.Over93percentofthe
linkedsurnamecombinationsmatchonthefirstletter,with80and71percentmatchingonthe
firsttwoandfirstthreeletters.
ThesecondpanelinTable2bshowsthedistributionbyJaro-Winklerscoreforgivennames.
Withgivennames,wearemoreconcernedwithstandardizingabbreviationsanddiminutives
thanwithwhetherwecanmatchdissimilarcombinationswithphoneticcodes.Thedistribution
oflinkedrecordsthathaveperfectgivennamesimilarityis53percent,withanother2.7
percenthavingasingleinitialmatchingthefirstletterofafullgivenname.Thisleavesover44
percentofthelinkswithlessthanperfectsimilarity.However,weconstructednamestandards
afterexaminingcombinationsofnon-identicalgivennamecombinationsinourlinkeddata.In
additiontothe53.8%oflinkswithanexactnamescore,another25%receiveanexactscore
afterstandardization.
10https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System;https://en.wikipedia.org/wiki/Metaphone;http://www.b-eye-network.com/view/1596;https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
14
Theoverallimprecisioningivennamesisalsodrivenbythefactthatsomeofourlinkedrecords
havedistinctlydifferentgivennamesinthetwoenumerations.Figure7givesafewexampleof
these.Thefirstsetshowslinkedrecordsthatwouldhaveagivennamematchifwecompared
firstnamestomiddlenames.Thesecondsetconsistsofexampleswherethegivenname
matchesamiddleinitialforthelinkedrecord(e.g.,the“N”in“BayardN”probablystandsfor
“Nelson”).Nonetheless,thethirdsetoflinkedrecordshavelittleornosimilarityintheirgiven
names,nordotheyhavemiddleinitialsthatmatchagivenname.Possibleexplanationsfor
givennameinconsistencywouldincludechangingpersonalpreferences,respondentbias,
enumeratorerror,andtranscriptionerror.
Table3givesthedistributionofageprecisionforourlinkedrecords.Ifenumeratorswere
givingarespondent’sageasoftheNovemberenumeration(ratherthanageonJune1st)then
beingayearolderinthesecondenumerationwouldbeconsideredagoodorperfectmatch.
Beingayearoffintheotherdirectionwouldalsobeconsideredagoodmatchifwewerelinking
acrossdifferentdecennialcensuses.Butthatwouldstillleaveover16percentofourlinked
recordswithanagedifferenceoftwoormoreyears.Somerespondentsmaynothaveknown
theirtrueage,andtheirresponsetoenumeratorsmayhavebeensomewhatrandom.Someof
theimprecisioniscausedbyrespondentbias,thatco-residentkinorevenneighborsmighthave
beensupplyinginformationtoagivenenumerator.Transcriptionerrorwouldalsocontribute
here.Regardlessofthesourceoftheerror,wesuspectthatagedifferencesintruelinksfound
intwodifferent19thcenturyU.S.censuseswouldhavesimilar(orpossiblyhigher)ratesof
imprecision.11
Thetablealsogivesthesomewhatsurprisinglyhighlevelsofsexerrorsinourlinkeddata,
wherealmostonepercentofthelinkedrecordshaveasexmismatch.Althoughwedidminimal
blockinginlinkingthetwoenumerations,wedidblockbysex.Afterestablishinglinksbetween
households,weoftenhaveremainingunlinkedrelatedhouseholdmembersinthehouseholdin
11PeterR.Knights,“AccuracyofAgeReportingintheManuscriptFederalCensusof1850and1860,”HistoricalMethodsNewsletter,Vol.4,Issue3,1971.RonaldGoeken,LapHuynh,T.A.LynchandRebeccaVick,“NewMethodsofCensusRecordLinking,HistoricalMethods:AJournalofQuantitativeandInterdisciplinaryHistory,Vol.44,Issue1,January2011.
15
bothenumerations.Weautomateaforcingproceduretolinktheserecords(ifpossible).We
evaluatedtheresultsthroughclericalreview,andintheprocessfoundmanyhouseholdswitha
singleunlinkedrecordinbothenumerationsthatwasverysimilarwiththeexceptionofasex
conflict.Theserecordstendedtobeyoungerindividuals,andoftenhadgivennamesthatwere
genderedequivalents(e.g.,JosephinetoJoseph,AugustatoAugust,andJuliatoJulius).Itis
possiblethatintheabsenceofadeclarationofgenderonthepartoftherespondent,infants
andsmallchildrenwouldnothavebeeneasilyidentifiedbytheenumeratorasmaleorfemale.
Thisalsoreflectstheoralnatureofthecensus;enumeratorsrecordedwhattheythoughtthey
hadheard.
Table3givesplaceofbirthandraceconsistencyforthelinkedrecords.Thereportingofthe
racevariablewasrelativelyconsistent,especiallyaftertakingintoaccountinconsistencyinthe
blackandmulattocategories.Only0.2percentofthelinkedrecordsgofromwhiteto
black/mulatto(orviceversa).Incontrast,over8percentofourlinkedrecordshavemismatched
birthplacesandover18percenthavemismatchesonparentalbirthplaces.Thedisagreement
rategoesdownquiteabitifwecombineallU.S.birthplacesintoasinglecategoryanddothe
samefortheforeignborn.Butevenusingthisconservativemeasure,1.3percentofourlinked
recordshaveaU.S.birthplaceinthefirstenumerationandaforeignbirthplaceinthesecond
enumeration,and1.2percenthaveaforeignbirthplaceinthefirstenumerationandaU.S.
birthplaceinthesecondenumeration.
OurevaluationoflinkagevariableprecisionfortheSt.Louisdataispreliminary,sincewehave
notfinishedlinkingthetwoenumerations.Theoverallimpressionatthispointisthata
significantnumberofthelinkedrecordswouldnotbelinkableattheindividuallevelbecauseof
lowsimilarity.Theonlywaywewereabletolinksomeofthehouseholdswasbyusingaddress
informationalongwiththeassumptionthatthetwoenumerationswerearelativelyclosed
universe.
HouseholdLinkingtheComplete-Count1870and1880U.S.Censuses
16
WesuspendedtheSt.Louislinkageprojectinlate2016(althoughweanticipatefinalizingthe
linkingatsomepoint).WeinitiallyhopedtousetheSt.Louislinkeddatatotrainindividual-level
classifiersthatwewouldusetolinkthevarious19thcenturyU.S.censuses.Onereasonwhy
thismightnotbeagreatideaisthatthehighlevelsofimprecisionfoundintheSt.Louislinked
datamightnotberepresentativeofwhatwewouldfindinthepopulationofalltruelinksfound
inthedecennialcensuses.Thisisbasicallyanissueofwhetherornotthetwoenumerationsof
St.Louiswereofatypicalpoorquality.Wehavenowayofdirectlyansweringthisquestion;we
suspectthatoveralltheaccuracy(orconsistency)foundinthe19thcenturyU.S.censuseswas
lessthanideal.TherelativelackofprecisioninthelinkedSt.Louisdatacouldbeaworsecase
example,butitcouldalsobewhatwewouldtypicallyexpectinenumerationsoflargeAmerican
citiesinthe19thcentury.
GivenconcernsaboutusingtheSt.Louislinkeddataastrainingdata,wedecidedtoapplythe
householdlinkingprocesstothecomplete-countdecennialcensuses.Itwasunclearhowmany
householdswewouldbeabletolink,butwewereconfidentthatitwouldbeasufficient
numbertotrainandtestindividual-levelclassifiers.Wewouldalsobeabletoconstructfalse
positiveestimatesbasedonverifiedlinks(atleastfortheproportionofthepopulationthatwe
wouldlinkandconfirmviathehouseholdlinkingprocess).
Theonlyrealimpedimenttoapplyingthehouseholdlinkingprocesstothecomplete-count
tabulationsistherelativesizeofthedatabases;e.g.,theUnitedStateshadapopulationof38
millionin1870and50millionin1880.Whenwebeganworkonlinkingthe1870and1880
completecountdatabaseslastfallitwastakingatleastaweekofprocessingtimetogeneratea
basicpotentiallinksfile.Earlierthisyear,however,wemadesomeimprovementsandare
currentlyabletogenerateapotentiallinksfilecomparing1870to1880inaboutaday.
Weblockbysexandplaceofbirth.Wewriteoutpotentiallinksifexpectedagedifferenceis
lessorequaltofiveandbothgivenandsurnamesimilarityisgreaterorequalto0.8(Jaro-
Winkler).Ifthegivennameisaninitial(ineitheryear)anditmatchesthefirstletterofthegiven
nameforarecordinthecompareyear(regardlessofwhetheritisaninitialorfullname),then
givennamesimilarityissetat0.8(andisthuseligibletobeincludedinthepotentiallinksfile).
17
Wealsoapplyarelativelyshortlistofgivennamestandards(basedonourSt.Louishousehold
linkeddata).12Theoutfileconsistsof2.4billionpotentiallinks.13
Atthispointweareonlyinterestedinrecordsthatconstituteacluster;basicallywewantto
examinesetsoftwoormorepotentiallinksbetweenspecifichouseholdsin1870and1880(i.e.,
potentialhouseholdlinks).Thuswefilteroutanypotentiallinkthatisthesolelinkbetween
specific1870and1880households.Thisreducesthefileto79millionindividualpotentiallinks
and38million1870and1880householdcombinations.Althoughthepotentiallinksfileuseda
0.8surnamethreshold,weinitiallyonlyprocessrecordsthathavesurnamesimilarityofatleast
0.9.Thisfurtherreducesthefileto48millionindividualpotentiallinksand21million1870and
1880householdcombinations.Mostofthe1880householdsthatareincludedinthepotential
linksfilehavemultiplehouseholdsin1870aspotentiallinks(e.g.,only10percenthavea
potentiallinktoasinglehouseholdin1870).Atthisstagewehaveambiguitygiventhatweare
usingrelativelylowageandnamesimilaritythresholds,andsomebirthplaceblockscontaina
disproportionatelylargenumberofrecords(e.g.,NewYorkState,Ireland,Germany).Wecould
attempttodisambiguateconflictinglinksbasedoncompositehouseholdageorgivenname
similarity,butwewerefairlyconfidentthatapplyingrulessimilartothoseusedlinkingtheSt.
Louisenumerationsprovideagoodfirstapproximation.
Weworkfromtheperspectiveof1880andcalculatethenumberofindividuallevellinks
betweenaspecific1880householdand1870households(theminimumwillbeapotentiallink12Andwedoafour-waygivennamecomparisonontakethemaximumvalue(i.e.1.70raw/80raw;2.70raw/80std;3.70std/80raw;4.70std/80std.13WeusecustomsoftwarewritteninPythontocomparerecordsbetweencomplete-countdatasets.Developmentofthesoftwareconsiderstheperformanceeffectsoffourmainparameters:I/Otime(includingnetworkcommunication),computetime,memoryconsumption,anddiskspace.Oursoftwarekeepsdataondiskaslongaspossible,onlypullingindatawhenneededandimmediatelywritingitbackouttodiskattheconclusionofprocessing.Thisstrategyrequiresmanymorediskreads/writesthananalternativeapproachthatkeepsdatainmemory,butisrelativelyfault-tolerant,sincethedataareimmediatelypersistedtolong-termstorage.Withextrapreprocessing,useofappropriatesystemcalls,andproperbalancingbetweendatachunksizeandnumberoftasks,I/Otimeisreducedrelativetocomputetime.Randomaccesstothedataisenabledbygeneratinganindexonthedatapriortorunningcomparisonsandprocessingisamortizedacrossmanysmalltasks,severalofwhichcanrunconcurrently.TheauthorsacknowledgetheMinnesotaSupercomputingInstitute(MSI)attheUniversityofMinnesotaforprovidingresourcesthatcontributedtotheresearchresultsreportedwithinthispaper.URL:http://www.msi.umn.edu"
18
toonehouseholdin1870consistingof2individuallevellinks).An1880householdis
consideredlinkedifithasatleast4individuallinkstoaspecific1870HHandnomorethan2
individuallinkstoanyother1870household.Inaddition,an1880householdwith3individual
linkstoaspecific1870householdandnomorethan1linktoanyother1870householdis
linked.Thisinitialruleestablishes1,553,420householdlinksconsistingof6,473,809individual
links.
Wehavenowayofmeasuringourfalsepositiverate.However,wecanlookforindirect
evidenceintheformofinconsistency.Sincewedonotuseplaceofresidenceinformationto
establishlinks,wecanusethecrudemigrationrate(definedasnotlivinginthesamestateand
countyinbothcensuses)asaproxyforthefalsepositiverate.Inotherwords,weexpecttosee
fairlyconsistentratesoflivinginthesamestateandcountyinourlinkedhouseholdsregardless
ofnon-demographiccharacteristics.Forexample,ageandgenderarelikelytohaveaneffecton
migrationbehavior.Butoverallsimilarityornamecommonnessinourlinkedrecordsshouldnot
havealargeeffectonmigrationbehavior.Allthingsbeingequal,ifalinkedhouseholdresidesin
thesamestateandcountyinbothenumerations,ourconfidencethatthisisatruelink
increases;alinkedrecordthatisalsoanon-migrantisrarelyanerror.However,migrantsare
typicallyamixoftruelinksandfalsepositives.14
Table4givesmigrationstatusforthefirstbatchoflinkedhouseholdsbyvariouslinkage
metrics.Thetoppanelgivesmigrationratesbasedonsurnamesimilarity.Thisisahousehold
measure(andweselectthefirstpotentiallinkwithanuclearrelationshiptorepresentthe
household).Thereappearstobearelationshipbetweensurnamesimilarityandbeinganon-
migrant,althoughtherangeisrelativelysmall.Itispossiblethatmigrantsarelesslikelytohave
theirsurnamesrecordedaccuratelyorconsistently,butitisalsopossiblethatwearemore
likelytohavefalsepositivesassurnamesimilaritydecreases(andthushigherlevelsof
migrationforlinkedrecordswithlowersurnamesimilarityindicateahigherprobabilityoffalse
positivesatlowerlevelsofsurnamesimilarity).
14 Thisarelativeratherthananabsoluterule.SomeAmericancountieshavepopulationsgreaterthanthetotalsfortheleastpopulatedstates.
19
ThesecondpanelinTable4givesmigrationratesforoverallrecorduniqueness.Weconstructa
uniquenessscorebasedonthenumberofpotentiallinksgeneratedbythegivenpotentiallink
(whichisdictatedbywhetherarecordhasarelativelycommoncombinationofgivenand
surname,butalsobytheoverallsizeoftheirbirthplaceblock).Wetaketheinverseonthe
individuallevel,andcalculatetheaverageforthehousehold.Forexample,ifagivenrecordin
1880hasonlyonepotentiallinkto1870,theindividualscore=1/1(1.0).Ifagivenrecordhas
100potentiallinksinthe1870data,thentheindividualscore=1/100(0.01).Thushigh
householdscoresindicaterelativeuniqueness.Thereappearstobeaclearrelationship
betweenlowerhouseholduniquenessscoresandmigration,althoughtherangeisagain
relativelysmall.Wewouldnotexpectdifferentlevelsofhouseholduniquenesstoaffectthe
decisiontomovebetweencensuses;thusthedifferentialisindicativeofhigherfalsepositive
ratesashouseholduniquenessdecreases.
Thebottompanelgivesthemigrationratesbasedonhowmanyrecordsconstitutethelinked
household.Hereitispossiblethatthedifferentialdoesnotindicatefalsepositives,butrather
indicatesthatsmallerhouseholds(andespeciallyiftheywereyoungercouples)wereinfact
morelikelytomovebetweencensuses.Nonetheless,weanticipatethattherearefalse
positivesinourinitialsetofhouseholdlinks,andthatTable4providescluesaboutwherewe
wouldmostlikelyfindthem;householdlinksbasedontheminimumnumberofindividuallinks,
andthosecomprisedofrelativelycommonrecordsandloweroverallsimilarity(eitheroverall
ageorgivenandsurnamesimilarity).15
Table5givesthehouseholdlinkagerateafterthefirstroundofrules-basedhouseholdlinking.
Weonlylink15percentofall1880households,butmost1880households(52percent)arenot
atriskofbeinglinkedbecausetheycontainfewerthanthreelinkable1880records(with
linkabledefinedasahavinganuclearrelationshiptoheadandbeingatleast10yearsoldin
1880).However,welink32percentoftheeligiblehouseholdsandover40percentofthe
householdscontaining5ormorelinkablerecords.Thetablealsogiveshouseholdlinkagerates
byraceandnativity(basedonthehouseholdhead’sraceandplaceofbirth),withnative-born15OnepossibilityexplainingthedifferentialsinTable4isthatmigrantsaremorelikelytoliveinplaceswheretheoverallenumerationqualityislower;i.e.,urbanandfrontierareas.
20
whitesthemostlikelytobelinkedundertherules-basedapproach.Wesuspectthatnon-white
groupshaveloweroverallprecision(andpossiblylessstablehouseholds).Theforeign-born
mighthavelowerlinkageratesbecauseofloweroverallprecision(especiallyintherecordingof
surnameinformation),butthelowerlinkageratecouldalsobecausedbythefactthatsomeof
themwerenotpresentintheUnitedStatesin1870(andwedonothaveyearofimmigration
informationinthe19thcenturycensuses).
Overall,the32percenthouseholdlinkagerateispromising.And,basedonourexperience
linkingtheSt.Louisdata,manytruehouseholdlinkswillbefoundifwelowerthesurname
similaritythreshold(fortheinitialpasswesetthethresholdat0.9).Butwealsofeltthatmany
truehouseholdlinkswereinourcurrentpotentiallinksuniverse(i.e.,atthe0.9surnamelevel)
butremainedunlinkedbecauseofambiguity(multipleconflictingpotentialhouseholdlinks)or
becauseoflownumbersoflinkable1880members(threeorfewerpotentiallinksinapotential
householdlink).Anditispreferabletoestablishtheselinksbeforewetrytolinkhouseholds
withlowersurnamesimilarity.
LinkingHouseholdsBasedonEvidenceofCommonNeighbors
Eventuallywewilldevelopmeasurestoidentifythemostsimilarhouseholdincasesof
ambiguity,butaquickanddirtyapproachwouldbetotakethenon-migranthouseholdifthere
aremultiplepotentialhouseholdlinks.However,whilecrudenon-migrationworkswellasa
diagnostictool,itisnotalwaysapreciselinkingvariable.LargeAmericancitiesaretypically
locatedinasinglecounty.Inaddition,forsomesmallstatesahighproportionofallindividuals
borninthestatewillresideinthelargestcityinthatstate(e.g.,BostonMassachusetts;
ProvidenceRhodeIsland;Baltimore,Maryland).Alsosomeethnicitiestendtoclusterinlarge
cities.Forexample,linkinganIrishhouseholdlivinginBostoninboth1870and1880doesnot
providedefinitiveevidencethatthisisthetruelink.
Althoughweplanoncontinuingtoexperimentwiththefollowingapproach,wecurrently
constructameasureofpotentialhouseholdneighbors.Wehave38millionpotentialhousehold
21
combinationsinourinitialpotentiallinksfileandover99percentaretheonlypotential
householdlinkforthegivencombinationof1870censuspageand1880censuspage(thereare
40linesperpagein1870and50linesperpagein1880).Allthingsbeingequal,thepresenceof
twoormorepotentialhouseholdlinksonthesamepagecombinationswouldincreaseour
confidencethatthesepotentialhouseholdlinksarethetruelinks.Butmanyneighborswillnot
showuponexactlythesamecensuspagecombinationinthetwoenumerations.Typically
householdsenumeratedtenyearsapartwouldnotbeenumeratedintheexactsequenceeven
iftheyhadnotphysicallyrelocated;directevidenceofneighborsdependssomewhaton
whetherornottheenumeratortookthesamerouteintwodifferentenumerations.Butmany
non-moversshouldhavecommonneighborsintheenumerationsregardlessofwhetherornot
theyshowupinthesameexactsequence.
Currentlywecalculatethenumberofpotentialhouseholdlinksforspecificgridsconsistingof
rangesofimagesinthe1870and1880data.Thegridiscalculatedfromtheperspectiveof
specificpotentialhouseholdlinks(thuseachcombinationof1870pageand1880pagewillhave
itsownuniquegrid).Forexample,apotentialhouseholdlinkislocatedonpagexin1870and
pageyin1880.Thegrid(forthispotentialhousehold)isdefinedasxplus/minus10(pages)in
1870andyplus/minus8(pages)in1880(thereare40linesperpagein1870and50linesper
pagein1880;thusthegrid,basedonthisdefinition,consistsofamaximumof840recordsin
1870and850recordsin1880).Andwewanttoknowhowmanyotherpotentialhousehold
linksarepresentinagrid.
Table6givesthedistributionofthepotentialhouseholdlinksbythenumberofpotential
householdneighbors(PHHN)intheirrespectivegrid.Approximately59percentofthetimethe
specificpotentialhouseholdlinkwillbetheonlypotentialhouseholdlinkinthegrid(i.e.,PHHN
=1).Someofthesecouldbetruelinks(ifthehouseholdphysicallymovedbetweencensuses
andthusdoesnothaveanycommonneighbors),butwesuspectthatmostarefalselinks.The
rightsideofthetablegivesthePHHNdistributionfortherules-basedlinks.Thetablealsogives
therelationshipbetweenPHHNandmigrationstatusforourfirstbatchoflinkedhouseholds;
over70percentofourinitialhouseholdlinksaremigrantsiftheyaretheonlypotentiallinkin
22
theirgrid.Asgridcountincreases,thehouseholdlinksareincreasinglynon-migrants.16
Figure8showsthepotentialhouseholdlinkscontainedinasinglegrid.Thereference
householdishighlighted(the“Turks”),andthisistheonlypotentialhouseholdlinkonthe
specificcombinationof1870pageand1880page.Theirgridisdefinedas1870page+/-10
pagesand1880page+/-8pages,andthereare12otherpotentialhouseholdlinksinthisgrid;
thusthePHHNforthereferencepotentialhouseholdlink(theTurks)is13(andthePHHNfor
otherpotentialhouseholdsinthefigurewillhavedifferentvaluesforPHHNbecausethegrid
movesaswecalculatePHHNforothercombinationsofpages).Thefiguredoesnotcontain
pageinformation,butitdoescontainhouseholdserialnumberinformation.Theserialsforboth
yearsaresettozeroforthereferencehouseholdintheexample(theTurks),withthevaluesfor
otherpotentialhouseholdsequaltothedifferencebetweentheiractualhouseholdserialand
theactualhouseholdserialforthereferencehousehold.Forexample,theKimehouseholdhas
aserial80diff=2,meaningtherewasonehouseholdlocatedbetweentheTurkhouseholdand
theKimehouseholdin1880.For1870thevalueis-10,meaningtherewereninehouseholds
betweentheTurkhouseholdandtheKimehouseholdin1870.
AhighvalueforPHHNtypicallyindicatesthetruehouseholdlink,butweinitiallyexpectedsome
potentialhouseholdlinkstohaverelativelyhighvaluesbutstillbeafalselink.Thuswecombine
thePHHNwiththehouseholduniquenessscorediscussedearlier.Theaverageuniqueness
scoreforahouseholdrangesfrom0to1.0,whichweconverttoaninteger(i.e.,1to100).
ComboscoreistheproductofPHHNandthehouseholduniquenessscore.UsingFigure8asan
example,therangeofPHHNis10to27,therangeofuniquenessscoreis2to40,andtherange
forcomboscoreis26to1040.
Withoutmuchexperimentationwedecidedtocreateanotherbatchoflinkedhouseholdsbased
onthecomboscore.Wealsodecidedtoincludesmallerhouseholds(i.e.,potentialhouseholds
withonlytwopotentiallinks)intheeligibleuniverse.Thusany1880householdnotlinkedinthe
firstpass(rule-based)thathasatleasttwoormorepotentiallinksiseligible.Ifthepotential
16AndthesmallpercentageofpotentialhouseholdlinksthathavehighPHHNandarealsoamigrantareapparentlyresidentsofcountiesthatexperiencedboundarychangesbetween1870and1880.
23
householdlinkhasthemaximumnumberofindividualpotentiallinksforthathousehold,and
thepotentialhouseholdhasacomboscoreofatleast100,weconsideritlinked.Figure8shows
howthisruleaffectsthehouseholdsinthisgrid.Fiveofninehouseholdsthatwereinitially
unlinkedarenowlinked.Inaddition,itseemsthatthecurrentcomboscorethresholdistoo
conservative;alloftheremainingunlinkedhouseholdsappeartobetruehouseholdlinks.
Again,thisfirstpassonlyusedpotentiallinksabove0.9surnameJ-W,andouroriginalpotential
linksfilecontainspotentiallinksdowntothe0.8surnamelevel.Afterflagginglinked
householdsfromthe0.9level(boththerulesbasedlinkedhouseholdandthehouseholdlinks
basedoncomboscore),wesetthemasideandincludeallrecordsfromcurrentlyunlinked
householdsandrepeattheprocess.Table7givesthenumberofhouseholdslinkedattheend
ofthe0.8surnamelevelpass(8categories);2.4millionlinkedhouseholdsconsistingofover9
millionindividuallinks.
Table7alsogivesthenon-migrationratesforthe8categoriesofhouseholdlinks.However,
sinceweusedthepresenceofcommonneighborstoestablish6ofthe8categoriesoflinked
households,thenon-migrationrateisnotanindicationofconsistency(atleastnotasa
comparisontothecategoriesofhouseholdlinks(i.e.,rulesbased)wherewedidnotusethe
presenceofcommonneighborstoestablishthelink).Acomparisonofthe1stcategory(rules-
basedhouseholdlinksusinga0.9thresholdforsurname)tothe5thcategory(rules-based
householdlinksusinga0.8thresholdforsurname)showsthatthelattercategorydoeshavea
lowerrateofnon-migration,whichcouldbeindicativeofahigherrateoffalsepositives.Table8
replicatesthediagnosticsshownearlierinTable4(whichusedthe0.9surnamethreshold,rules
basedhouseholdlinks).Ingeneralthe2ndbatchofrules-basedhouseholdlinkshavelower
ratesofnon-migrationcomparedtothesamecategoriesinTable4,butoveralltherangefor
the0.8threshold(rules-based)householdlinksissimilartowhatwefoundforthe0.9threshold
(rules-based)householdlinks.
ThetoppanelinTable9showsthehouseholdlinkagerateforall1880householdsbythe
numberof1880linkablerecords.IncontrasttoTable5,whereweonlyincludedthefirstbatch
ofrules-basedhouseholdlinks(usingthe0.9surnamethreshold),thisversionincludesallofour
24
currenthouseholdlinks.Ouroveralllinkagerateisnowover24percent,althoughthelinkage
rateremainsquiteabitlowerforthesmallerhouseholds.Thebottompanelofthetable
restrictstheuniverseto1880householdsatriskofbeinglinkedandgivesthehouseholdlinkage
ratebyraceandnativity.Sinceweeventuallywerewillingtolink1880householdswithtwo
linkablerecords,theonly1880householdsnotinthelinkableuniversearethe1880households
thatonlycontainonelinkablerecord.Thelinkageratefor1880linkablehouseholdsis26.3
percent,whichislowerthanthecomparablefigureinTable5(whichwas32.4percent).Butthe
linkablehouseholduniversehereisinflatedbytheinclusionof1880householdscontainingonly
twolinkablerecords(whichmakeupalmosthalfofthe1880households,butareonlylinked7
percentofthetime).Andwesuspectthatmanyofthehouseholdscontainingonlytwolinkable
recordsdidnotexistin1880(i.e.,youngermarriedcouples).
Table9gavethenumberofindividualpotentiallinkscontainedinourcurrentbatchoflinked
households.However,thisunderestimatesthenumberoftruelinksinthelinkedhouseholds;
similartowhatwefoundinourSt.Louislinkedhouseholds,wehavemanycurrentlyunlinked
recordsinourlinkedhouseholdsthatappeartobethetruelink.Figure9showsafewexamples
oflinkedhouseholds.Inthefirstexampleweestablishthelinkedhouseholdbasedonthe
householdheadandspousein1880(W.N.andSarahAnn)andoneoftheirchildren(Ida).
However,thereareotherchildreninthe1880householdwhowerealsopresentinthe
householdin1870.Butwewereunabletoestablishtheselinksattheindividuallevelbecause
ofbirthplaceinconsistency(JohnandWalterhadmissingbirthplaceinformationin1870,while
HowardwasborninIowain1870andIllinoisin1880)andlowgivennamesimilarity(Coravs.
Carrieforthedaughter).Andwecanassumethateight-year-oldWillieinthe1880household
wasnotyetbornin1870.
Thesecondexampleshowsahouseholdwithfourexplicitlinks.Thethreeunlinkedmembersin
the1880householdalsoappeartobeinthe1870householdbutwereunlinkedbecauseof
excessivedifferencesinexpectedage(theheadwasage28in1870andage46in1880,while
thespousewasage25in1870andage43tenyearslater)andgivenname(AnnE.vsAnaliscia).
Andweassumethetwoothermembersofthe1880householdwerenotpresentin1870
25
(MinnieI.was9-years-oldin1880andwasprobablyunbornin1870andJohnPetermanwasa
21-year-oldunrelatedindividualin1880).
Thehouseholdsinthethirdexamplecontainfiveindividualsexplicitlylinked.Wewereunable
tolinkElwoodC.attheindividuallevelbecausehe/shewasenumeratedasamalein1870and
asafemalein1880.Despitethenamedifference,wearefairlyconfidentthat0-year-old
RosettaJ.in1870isactually10-year-oldJosephineR.in1880.Itisalsopossiblethat21-year-old
Minervain1870is29-year-oldLouizaJ.in1880.ButincontrasttoRosettaJ.-JosephineR.,
wheretransposingfirstandmiddlenamesresultsinsimilarity,thereisnoobviouscommonness
betweenthenamesMinervaandLouizaJ.
Theseexamplesarenotstrictlyrepresentative,butdemonstratethatmanyofourlinked
householdsin1880containunlinkedrecordsthatalsohavetheirtruelinkinthe1870
household.Ingeneral,ifweestablishalinkedhousehold,thenweexpectunlinkedrecordswith
anuclearrelationship(i.e.,head,spouseorchild)andagegreaterorequalto10toalsobe
presentinthe1870household.Therearecategorieswherethisassumptionislesslikelytobe
true.Forexample,anolderchildin1880mighthavealreadylefthomeatthetimeofthe1870
censusdespitebeingpresentforthe1880enumeration.Theyoungestlinkablechildrenin
1880—ten-oreveneleven-year-oldsforexample—mightactuallyhavenotbeenbornatthe
timeofthe1870census(andsomeofthenine-oreveneight-year-oldsin1880wereactually
alivein1870).Spouseswithlowageornamesimilaritycouldbeindicativeofsecondmarriages.
Giventheseexceptionstoourgeneralassumptionsaboutco-residentialpersistence,weinitially
adoptedafairlyconservativeapproachtoforcinglinkagesbetweenrecordswithlowsimilarity
forkeylinkagevariables.
Wewilleventuallydevelopamorenuancedapproachtodealwiththiscomplexproblem,but
forthispaperweadoptedasimpleprocedurebasedonourhouseholdlinkingrules.Firstwe
dropallthresholds,andcompareallunlinkedhouseholdmembersfromthe1870householdto
alllinkablemembersofthe1880household(i.e.,weblockbyhouseholdandexclude1880
recordsyoungerthantenandthosewithanon-nuclearrelationshiptohead).Weawardone
pointforeachofthefollowing:samesex,samebirthplace,agewithin4yearsofexpectedage,
26
andgivennamesimilaritygreaterthan0.9.UsinganexamplefromFigure9,ElwoodCin1880
wouldgetthreepointsforthecomparisontoElwoodCin1870(onepointeachforgivenname,
age,andbirthplace—butnotforsex—foratotalofthreepoints).Themaximumnumberof
pointsfortheforcingprocedureisthreepoints(becausealloftheserecordsfailedtolink
initiallybecauseoflowsimilarityormismatchinatleastoneofthekeylinkagevariables).Ifa
comparisongetsthreepoints,andnoothercomparisongetsatleastthreepoints,thenwe
forcethelink.
Figure10showstheforcedlinkingprocedureappliedtothehouseholdsfromFigure9.Despite
failurestoinitiallylinkattheindividuallevel,alloftheforcedlinkslookhighlyprobablewiththe
exceptionofLouizaJ.toMinervaintheMillerhousehold,butevenherewewouldassumethat
thereisapossibilitythatLouizaJisactuallyMinerva.Theforcingprocedureestablisheslinksfor
1,183,892records,orabout71percentoftheunlinkedbutlinkable1880records.Someofthe
currentforcedlinksareerrors,butweanticipaterefiningtheapproachtoaddresstheissueof
falsepositives.Butitalsoappearsthatmanyofthelinkablebutstillunlinked1880recordsdo
havetheirtruelinkresidinginthe1870household.InthefirstexampleinFigure11wehave
oneunlinkedrecordin1880household,21-year-oldJohnW.,whoisprobably11-year-old
Walkerinthe1870household.Inadditiontothelowsimilaritybetweenthegivennames,the
tworecordshavemismatchedbirthplaces.ThesecondhouseholdinFigure8showsanextreme
exampleofambiguityintheforcingprocess.The1870householdcontainstwo13-year-old
maleswithgivennamesofAbdaFandAbbaF.Despitethepresenceoftwomalesinthe1880
householdwhowere23yearsold,theforcingprocedurecannotdeterminethecorrectlink(i.e.,
becauseeithercouldbeFelixorFestusinthe1880household).
Areviewofourforcedlinksdisclosesthatlowgivennamesimilaritywastheprimaryreason
recordswerenotlinkedaspartoftheinitialhouseholdlinkingprocess.Weanticipateimproving
ourgivennamestandardizationprocess,whichwouldincreasethegivennamesimilarityfor
someoftheserecords(andthusincreasingtheprobabilitythattheserecordswillbecompared
totheirtruelinkattheindividuallevel).Butasseeninpreviousexamples,manytruelinkswith
lowgivennamesimilaritywereenumeratedwithdistinctlydifferentgivennamesinthetwo
27
enumerations.Wehave41,472maleswiththegivennameofHenryin1880inthegroupof
forcedlinks.Approximately45percentalsohadagivennameofHenryin1870,withamuch
smallerpercentagehavingnamesorvariantsthatcouldbestandardizedasHenry(likeHarryor
Harvey).ButmosthavegivennamesthataredefinitelynotHenry.Forexample,wehave1,714
Henry-Williamcombinationsandalmost40percentoftheWilliamshaveamiddleinitialof‘H’
in1870.Manyoftheforcedlinksthathavelowgivennamesimilarityalsohaveamiddleinitial
thatincreasesconfidenceinthelink,butamajoritydonothavemiddlenameorinitial
information.
Althoughwehavenotfinisheddevelopingacomprehensiveapproachtothehouseholdlinking
process,wehavebeguntoassesstherangeofprecisionforourkeylinkagevariables.Tables10
and11givetherangeofimprecisionforourcurrentlinkeddata,whichincludesbothexplicit
andforcedlinks.Ingeneral,precisionlevelsarehigherforourcomplete-counthouseholdlinks
comparedtotheSt.Louishouseholdlinks(seeTables2and3).However,theabilitytomake
strictcomparisonsislimitedbyanumberoffactors.Forexample,approximately11percentof
ourcomplete-counthouseholdlinkshavesurnamesimilaritybelowthe0.9level.The
comparablefigurefortheSt.Louishouseholdlinkswas28percent.Weexpecttheproportion
ofcomplete-countlinkswherethisistruetoincreaseasweloweroursurnamesimilarity
thresholdinthepotentiallinkselectionprocess;i.e.,someofthecurrentlyunlinkedhouseholds
areunlinkedpreciselybecauseallhouseholdmembershavelowsurnamesimilaritytotheirtrue
links.17TherelativelycloseduniverseofthetwoenumerationsofSt.Louis,alongwiththe
availabilityofstreetandhousenumberinformation,allowedustolinksmallerhouseholdsor
householdswithlowlevelsofsimilarity;inotherwords,wewereabletogetclosertothe
bottomofthebarrelthanwewilleverbeabletodowithhouseholdsenumerated10years
apart.18
17 AndthissamelogicwouldapplyhigherlevelsofimprecisionforplaceofbirthinthelinkedSt.Louisdata;weblockedbyplaceofbirthinconstructingthecomplete-countindividuallevellinks;wesuspectthatsomeofthecurrentlyunlinkedhouseholdsareunlinkedbecausemostorallhouseholdmembershavemismatchedbirthplaceinformation.18 ItisalsopossiblethatthefirstenumerationofSt.Louiswasanexampleofashoddilytakencensus,whilethesecondenumeration—whichusedareferencedatefivemonthspriortothedateoftherecount—introducedimprecisioninrecordinginformationforindividualswhohadleftthecity.Amore
28
GoingForward
Ourcurrentlinkageprojectwilleventuallyincludelinkscoveringthe1850,1860,1870,and
1880complete-countcensusdatabases.Basedonourinitialresults,wearefairlyconfidentthat
wewilllinkafairlysizableproportionof1880recordstoallthreeofthepreviousdecennial
censusesusingthehouseholdlinkingapproach(yearofbirthpermitting).Goingforward,some
ofourworkwillfocusonbettermethodsofidentifyingandeliminatingfalsepositives.Theuse
ofadditionalevidencederivedfromcommonneighborsandco-residentkinimpliesthatwe
haveahigherstandard;our(unachievable)goalistonevermakeanincorrecthouseholdlink.
Qualitycontrolcanbetedious(anddemoralizingwhenituncoversalogicalflawortwo)butitis
anecessarypartoftheprocess.Andwewillcontinuetoevaluatequalityissuesasweproceed
tocreateadditionalhouseholdlinksinthe1870-1880data.Somehouseholdswillneverbe
linked,butwehopetoultimatelydoubleourcurrenthouseholdlinkagerate.Someofour
optimismisbasedonourexperiencewithSt.Louis;althoughwealreadysawdiminishing
returnsinoursecondpassusingalowersurnamethreshold,weanticipatefindingadditional
householdlinksbelowa0.8surnamesimilaritythreshold.Wealsosuspectasignificantnumber
ofhouseholdsremainunlinkedbecauseofbirthplaceinconsistency.Analysisofourforced
links—oftenforcedduetolowgivennamesimilarity—willresultinimprovedgivenname
standardizations(oraliases).Wewillalsorefineourmeasurementofhouseholduniqueness
andneighborcalculations.ThePHHN(i.e.,commonneighbors)approachneedssome
calibration,butpromisestolinkmanyadditionalhouseholds.
Althoughweanticipatecontinuingtofindhouseholdsbasedontheprocessofelimination,
somehouseholdswillremainunlinkedbecausetheydidnotexistinthepreviouscensus.A
commonexamplewouldbeoldersonsinthe1870censuswholeavehomeandgetmarried;
charitableinterpretationwouldbethatimprecisionfoundinSt.Louisin1880wouldberepresentativeofenumerationsinlargeAmericancitiesinthenineteenthcentury,andthatwewouldexpectgreaterprecisionforindividualsenumeratedinsmalltownsandruralareas.WhetherornottheimprecisioninthelinkedSt.Louisdataisanoutlierisaninterestingissue,butwenonethelessalsofindrelativelyhighimprecisioninthecomplete-countlinkeddata.
29
thustheywillbelivingwithaspouseandchildrenundertheageof10inthe1880census.
However,ifwecanlinktheirhouseholdoforiginin1870toan1880householdandverifythat
theywereabsentfromthat1880linkedhousehold,thenwearemoreconfidentincreatinga
householdlinkabsentthepresenceofanycorroborativekin.Figure12givesanexamplebased
onthegridexample(i.e.,Figure8).Figure12givestheentire1870and1880householdsforthe
Mathishousehold,andwecanseethatthefouroldestsonsinthe1870householdwerenot
presentwhenthehouseholdwasenumeratedin1880.Althoughthishouseholdwasnotthe
referencepointforthisspecificgrid,wecanidentifywhatappearstobetwooftheabsentsons
(withtheirwivesandchildren)inthisgrid,andtheyarelocatedincloseproximitytothe1880
householdthatcontainstheirparentsandyoungersiblings.Wedonotknowhowmanyof
thesetypesofhouseholdswewillbeabletolink,butwebelievetheuseofcommonneighbor
informationgreatlyexpandsourabilitytoconfidentlyverifylinkagedecisions.
Thehouseholdlinkswillbeusefulforsometypesofanalysis(e.g.,wheretherelevantunitof
studyconsistsofmarriedcouplesorrelatedgroups)buttheywilldefinitelybebiased.Butwe
alsoanticipatecontinuingtoconstructindividual(minimalbias)levellinks.Herethehousehold
linkscanbeusedintwoprimaryways.Theycanbeusedasaverificationsetforlinks
establishedattheindividuallevel.Andthehouseholdlinksareanimportantpartofthisprocess
becauseofthepresenceoftheforcedlinks(i.e.,linksnotinitiallypresentinourpotentiallinks
file,typicallybecauseoflowsimilarityormismatchinatleastonelinkagevariable).Forthe
mostpart,theserecordswillrarelybelinkedbyindividual-levelclassifiers.Anaccurate
estimationofthefalsepositiveraterequiresestablishingalltruelinks(despitelowsimilarityor
mismatchedlinkagevariables)andtheonlywaytodothisistouseamaximumamountof
information(i.e.,thehouseholdlinkageprocess).
Oneissuewiththehouseholdlinksasaverificationsetisthattheywillnotcovertheentire
populationofindividual-levellinks.Thisistrue(i.e.,someindividuallevellinkswillnotbe
verifiedbecausewedidnotlinktheirhousehold),butwesuspectthatourindividual-levellinks
willcontainadisproportionatelyhighnumberoflinksestablishedatthehouseholdlevel.Thisis
becausetheinabilitytobelinkedatthehouseholdlevelimpliesanumberofconditionsor
30
characteristicsattheindividuallevel.
Wewould(theoretically)expectsimilarlevelsoflinkagevariableprecisionforsomegroupsof
individualsnotlinkedatthehouseholdlevel(comparedtothoselinkedatthehouseholdlevel).
Thiswouldincludethesonswhotransitiontomarriageandtheestablishmentoftheirown
householdsbetweencensuses.Thiswouldalsoincludehouseholdswithrelativelycommon
names(combinedwithlargebirthplaceblocks)thatremainunlinkedbecauseofambiguity
(especiallyiftheylackcommonneighborsinthetwocensuses).Thesetwogroupsshouldhave
overallprecisioncomparabletothehouseholdlinkedset(althoughambiguousrecordsatthe
householdlevelwillalsobeambiguousattheindividuallevel).
Butmanymembersofthehouseholdlinkingresistancearehardercases.Under-enumerationin
the19thcenturywasfairlyhigh(possiblyashighasfivepercent).Wealsohavesome1880
householdsthatwerenotinthecountryin1870(andfromarecordlinkageperspectivethey
aresimilartounder-enumeratedrecords).Wearestillinthespeculativestage,butinaddition
tohouseholdsmissingfromoneenumerationortheother,itseemsplausiblethatatleastthat
manyareunderwater(i.e.,wewillneverbeabletolinkthemevenatthehouseholdlevel
becauseoflowsimilarityormismatchforoneormorelinkagevariables).Lessextreme,butstill
problematic,isthesizable19thcenturyunrelatedpopulation.Rarelywilltheybeco-resident
withthesamepeopleinbothcensuses.Andtheaccuracyoftheirnames,ageandbirthplace
willundoubtedlyvary,butwesuspectthatthequalityofinformationforunrelatedindividuals
enumeratedinthe19thcenturyisrelativelypoor.
Sothepartofthe1880populationthatisnotpartofthehouseholdlinkeduniversewillconsist
ofahigherproportionrecordsthateitherdonothaveatruelinkorhavearelativelylow
similaritytruelink.Itispossiblethatanindividual-levelclassifiertrainedandtestedonthe
householdlinks(andcalibratedtogetanoptimalcombinationoflinkageandfalsepositive
rates)willnotperformnearlyaswellonthesetofrecordsthatwerenotlinkedatthe
householdlevel(primarilybecausethisuniversecontainsmanyrecordswithouttruelinks,and
someoftheserecordsgetlinkedrandomlyatlowerlevelsofclassifier-approvedthresholds).
Butmaybethisreallydoesnotmatter.Itispossiblethatawell-designedindividuallevel
31
classifierachieves“acceptably”lowfalsepositiverates,inthatthepresenceofsomeincorrectly
linkedrecordsdoesnotsignificantlyaffectresearchresults.Thishasbeenthestandarddefault
positionforpreviouslinkageprojects,butithasmostlybeenbasedonspeculativeoptimism
(i.e.,faith-basedrecordlinkage).Ultimatelywehopetoproduceafairlycomprehensivesetof
verifiedhouseholdlinksfor1850through1880.Wewillalsoproducelinkeddataatthe
individuallevel.Thuswewillhavethreedifferentlinkedsets:householdlinks;individual-level
links;andindividual-levellinkswiththefalsepositivesremoved(i.e.,falsepositivesidentifiedby
comparingtheindividual-levellinkstothehouseholdlinks).Weplanonexperimentingwith
differenttypesofanalysis(e.g.female-laborforceparticipation,social-economicmobility,etc.)
toseeifwegetdifferentresultsbasedonwhichlinkedsetweuse.
Figure1.PrimaryandSecondaryLinks,1870-1880Male-OnlySample
linktype fname70
lname70
age70
relate70
fname80
lname80
age80
relate80
unlinked JOHN MCHUGH 50 head unlinked REBECCA MCHUGH 37 spouse primary HENRY MCHUGH 14 child HENRY MCHUGH 25 child
unlinked JAMESE MCHUGH 3 child unlinked JANER MCHUGH 0 child unlinked
CATHARINE MCHUGH 64 head
unlinked
ELLEN MCHUGH 38 child
unlinked
EDWARD MCHUGH 35 child
unlinked
MARYF. MCHUGH 27 child
unlinked
MARYE. MCHUGH 16 grandchild
unlinked
EDWARDJ. MCHUGH 12 grandchild
linktype fname
70lname70
age70
relate70
fname80
lname80
age80
relate80
primary JAMES FELKINS 61 head JAMESH. FELKIN 71 head
unlinked MARTHA FELKINS 53 spouse secondary NANCY FELKINS 35 child NANCY FELKIN 42 child
unlinked BUNELL FELKINS 28 child secondary ELISABETH FELKINS 16 child ELISIBETH FELKIN 23 child
secondary PAIKNEY FELKINS 14 child PINKNY FELKIN 22 child
unlinked
MATILDA FELKIN 67 spouse
Notes:fname70=firstnamein1870lname70=lastnamein1870age70=agein1870relate70=imputedrelationshiptoheadin1870fname80=firstnamein1880lname80=lastnamein1880age80=agein1880relate80=imputedrelationshiptoheadin1880
Figure2.1850SlaveSchedule
Figure3a.PotentialmatchesforJohnO’Donnell,St.Louis1880
fname1 lname1 age1 fname2 lname2 age2JOHN O'DONNELL 43 JOHN O'DONNELL 45
JOHN O'DONNELL 45
JOHN O'DONNELL 46
Figure3b.HouseholdscontainingpotentiallinksforJohnO’Donnell,St.Louis1880
fname1 lname1 age1 fname2 lname2 age2 p_link sum_p_linkJOHN O'DONNELL 43 JOHN O'DONNELL 46 1 5MARY O'DONNELL 43 MARY O'DONNELL 44 1 5
MICHAEL O'DONNELL 15 MICHAEL O'DONNELL 16 1 5PATRICK O'DONNELL 9 PATRICK O'DONNELL 9 1 5BRIDGET O'DONNELL 6 BRIDGET O'DONNELL 5 1 5
JOHN O'DONNELL 43 JOHN O'DONNELL 45 1 1MARY O'DONNELL 43 ELLEN O'DONNELL 40 0 1
MICHAEL O'DONNELL 15 JULIA O'DONNELL 12 0 1PATRICK O'DONNELL 9 0 1BRIDGET O'DONNELL 6 0 1
JOHN O'DONNELL 43 JOHN O'DONNELL 45 1 1MARY O'DONNELL 43 MARGRET O'DONNELL 39 0 1
MICHAEL O'DONNELL 15 JOHN O'DONNELL 19 0 1PATRICK O'DONNELL 9 ELIZEBETH O'DONNELL 14 0 1BRIDGET O'DONNELL 6 FRANCIS O'DONNELL 12 0 1
WILLIAM O'DONNELL 4 0 1
Notes:fname1=firstnameinfirstenumerationlname1=lastnameinfirstenumerationage1=ageinfirstenumerationfname2=firstnameinsecondenumerationlname2=lastnameinsecondenumerationage2=ageinsecondenumerationp_link=indicatesapotentiallinkbetweenindividualslistedsum_p_link=thesumofpotentiallinksbetweenspecifichouseholds.
Figure4.Alinkedhousehold,St.Louis1880
fname1 lname1 age1 fname2 lname2 age2 p_link sum_p_link
J-Wfname
J-Wlname
AUTONIA STROUBEL 52 ANTON STRUBE 53 0 4 0.69 0.94ELIZABETH STROUBEL 42 ELIZA STRUBE 42 1 4 0.91 0.94ANNIE STROUBEL 19 ANNIE STRUBE 14 0 4 1.00 0.94MINNIE STROUBEL 12 MINNIE STRUBE 13 1 4 1.00 0.94LOUISA STROUBEL 10 LOUISE STRUBE 11 1 4 0.93 0.94DORETTA STROUBEL 4 DORA STRUBE 5 1 4 0.90 0.94
Notes:fname1=firstnameinfirstenumerationlname1=lastnameinfirstenumerationage1=ageinfirstenumerationfname2=firstnameinsecondenumerationlname2=lastnameinsecondenumerationage2=ageinsecondenumerationp_link=indicatesapotentiallinkbetweenindividualslistedsum_p_link=thesumofpotentiallinksbetweenspecifichouseholds.J-Wfname=Jaro-WinklersimilarityscoreforfirstnamestringsJ-Wlname=Jaro-Winklersimilarityscoreforlastnamestrings
Figure5.Linkedhouseholds,St.Louis1880
fname1 lname1 age1 fname2 lname2 age2 fnameJ-W
lnameJ-W
MATHEW BURGHERDT 40 MATHEW BURKHART 47 1.00 0.86ELIZABETH BURGHERDT 40 ELIZABETH BURKHART 40 1.00 0.86CATHERINE BURGHERDT 12 KATE BURKHART 11 0.69 0.86ELIZABETH BURGHERDT 9 ELIZABETH BURKHART 9 1.00 0.86WILLIAM BURGHERDT 4 WILLIAM BURKHART 4 1.00 0.86
fname1 lname1 age1 fname2 lname2 age2 fname
J-WlnameJ-W
DAVID FITZGERALD 48 DAVE VETZGURA 45 0.85 0.67MARY FITZGERALD 34 MARY VETZGURA 36 1.00 0.67ANNIE FITZGERALD 12 ANNA VETZGURA 12 0.85 0.67KATE FITZGERALD 10 KATE VETZGURA 11 1.00 0.67
ANDREW FITZGERALD 5 ANDREW VETZGURA 6 1.00 0.67NORA FITZGERALD 2 MONORA VETZGURA 3 0.81 0.67
RICHARD FITZGERALD 0 RICHARD VETZGURA 0 1.00 0.67
fname1 lname1 age1 fname2 lname2 age2 fnameJ-W
lnameJ-W
FRANK KLAESER 60 F.H. CLASSEN 60 0.76 0.63BRIDGET KLAESER 56 BRIDGET CLASSEN 58 1.00 0.63
fname1 lname1 age1 fname2 lname2 age2 fname
J-WlnameJ-W
CAROLINE SCHWARTZ 60 CATHERINE SCHMARG 60 0.76 0.85AUGUSTA SCHWARTZ 26 AUGUSTE SCHMARG 25 0.94 0.85
Notes:fname1=firstnameinfirstenumerationlname1=lastnameinfirstenumerationage1=ageinfirstenumerationfname2=firstnameinsecondenumerationlname2=lastnameinsecondenumerationage2=ageinsecondenumerationfnameJ-W=Jaro-WinklersimilarityscoreforfirstnamestringslnameJ-W=Jaro-Winklersimilarityscoreforlastnamestrings
Figure6.Selectedsurnamecombinationsinthelinkeddata,St.Louis1880
lname1 lname2 J-W NYSIIS Doublemeta match1 match2 match3
COBB COBBS 0.96 1 1 1 1 1MAIER MIER 0.94 1 1 1 0 0BLOCH BLOCK 0.92 0 1 1 1 1
SCHLEGEL SCHLAEGD 0.90 0 0 1 1 1KAMPF KEMPF 0.88 1 1 1 0 0LAMPE LAMPKING 0.86 0 0 1 1 1NOOTEN NEWTON 0.84 1 1 1 0 0BORGERS BORSGUS 0.82 0 0 1 1 1GERRAN GUERIN 0.80 1 1 1 0 0
THORNALLY TOMALLI 0.78 0 0 1 0 0BOETTE BOOTH 0.76 0 0 1 1 0
BROCHRIGT BROOKLINE 0.74 0 0 1 1 1HEFFNER HOFFMANN 0.72 0 0 1 0 0RUBIN LUBIER 0.70 0 0 0 0 0
GOTTMAYER KOLMEYER 0.66 0 0 0 0 0THOMA TGNAZ 0.64 0 0 1 0 0BOICE NOYES 0.60 0 0 0 0 0
KOOKENBERG GUEGGESBERY 0.55 0 0 0 0 0KEEVIL DRISCOLL 0.53 0 0 0 0 0
Notes:0/1indicatesthatthenamecombinationwouldnotmatch/matchforphonetic/matchingcodesJ-W=Jaro-WinklersimilarityscoreforlastnamecombinationNYSIIS=whetherthenamecombinationhasaNYSIISmatchDoublemeta=whetherthenamecombinationhasadoublemetaphonematchMatch1=whetherthenamecombinationmatchesonfirstletterMatch2=whetherthenamecombinationmatchesonfirst2lettersMatch3=whetherthenamecombinationmatchesonfirst3letters
Figure7.Examplesoffirstnamemismatches,St.Louis1880
fname1 lname1 age1 fname2 lname2 age2C.ALBERT RAHNER 24 ALBERT RAHNER 24BERNARD HILL 23 C.BERNARD HILL 22BRIDGET CARTEN 33 M.BRIDGET CARTEN 34C.AMELIA SHEERER 32 AMALIAC. SHERER 35
fname1 lname1 age1 fname2 lname2 age2BAYARDN. ABBOTT 4 NELSON ABBOTT 3
BELLE HILTON 5 IDAB. HILTON 5DAVID SUTTMUELLER 47 JOHND. SULTMULLER 48ELLEN ROBINS 2 MARYE. ROBBINS 3
fname1 lname1 age1 fname2 lname2 age2THECKLA NIEHAUS 57 MARY NIEHAUS 57TIMOTHY LYNCH 17 BUD LYNCH 18LILLY WALSER 0 GRACE WALSER 0
WILLIAM PERRIN 0 EUGENE PERRIN 0
Notes:fname1=firstnameinfirstenumerationlname1=lastnameinfirstenumerationage1=ageinfirstenumerationfname2=firstnameinsecondenumerationlname2=lastnameinsecondenumerationage2=ageinsecondenumeration
Figure8.SampleNeighborGrid,LivingstonCounty,Illinois,1870-1880CompleteCount
rulesonly
rulesplusneighbors
serial70diff
Serial80diff
fname70 lname70age70
fname80 lname80age80
neighborcount(PHHN)
uniquescore
comboscore
-12 -67 MARY WOODRUFF 34 MARY WOODRUFF 43 10 5 50
-12 -67 ALPHONSO WOODRUFF 15 ALPHONSO WOODRUFF 25 10 5 50
linked *** -19 -66 JOHN ARNOLD 52 JOHN ARNOLD 63 10 6 60linked *** -19 -66 LOUISA ARNOLD 50 LOUISA ARNOLD 61 10 6 60linked *** -19 -66 WILLIAM ARNOLD 26 WILLIAM ARNOLD 35 10 6 60linked *** -19 -66 FRANKLIN ARNOLD 17 FRANKLIN ARNOLD 27 10 6 60
linked -13 -64 MARY BUSSARD 50 MARY BUZZARD 61 10 12 120
linked -13 -64 OZILLA BUSSARD 25 ROZILLA BUZZARD 36 10 12 120
linked -13 -64 WILLIAM BUSSARD 19 WILLIAM BUZZARD 28 10 12 120
-25 -17 GEORGE CHRITTEN 16 GEO CRITTEN 26 17 3 51
-25 -17 WILLIAM CHRITTEN 43 WILLIAM CRITTEN 55 17 3 51
0 0 SARAH TURK 39 SARAH TURK 48 13 2 26
0 0 EVALINE TURK 4 EVALIENE TURK 13 13 2 26
-10 2 JOSEPH KIME 35 JOSEPH KIME 48 17 2 34
-10 2 SUSAN KIME 31 SUSAN KIME 39 17 2 34
linked *** 2 14 D DEFENBAUGH 37 DAVID DEFFENBAUGH 46 19 36 684linked *** 2 14 ISABELL DEFENBAUGH 37 ISABELLA DEFFENBAUGH 48 19 36 684linked *** 2 14 GEORGIANNA DEFENBAUGH 9 GEORGANNA DEFFENBAUGH 19 19 36 684linked *** -7 28 SAMUEL THOMSON 52 SML THOMPSON 62 17 4 68linked *** -7 28 HARIET THOMSON 47 HARRIET THOMPSON 58 17 4 68linked *** -7 28 EDGAR THOMSON 5 EDGAR THOMPSON 15 17 4 68linked *** -29 43 CALEB MATHIS 46 CALEB MATHIS 56 22 18 396linked *** -29 43 SOFLENA MATHIS 43 SOPLENA MATHIS 53 22 18 396linked *** -29 43 SOFLENA MATHIS 9 SOPLENA MATHIS 19 22 18 396linked *** -29 43 WILLIAM MATHIS 7 WILLIAM MATHIS 16 22 18 396linked *** -29 43 HELLAND MATHIS 2 HOLLAND MATHIS 12 22 18 396
linked -47 44 SAERTIS SMITH 51 LAERTES SMITH 61 27 8 216
linked -47 44 LOUISA SMITH 48 LOUISA SMITH 59 27 8 216
linked -38 47 ANDREW WRIGHT 55 ANDREW WRIGHT 65 26 14 364
linked -38 47 EMELINE WRIGHT 44 EMMELINE WRIGHT 54 26 14 364
linked -41 53 WILLIAM BOATMAN 52 WILLIAM BOATMAN 62 26 7 182
linked -41 53 ELENOR BOATMAN 50 ELEANOR BOATMAN 60 26 7 182
linked -40 75 CHARLES THRASHER 55 CHARLES THRASHER 65 26 40 1040
linked -40 75 MARY THRASHER 43 MARY THRASHER 52 26 40 1040
linked -40 75 THANKFUL THRASHER 6 THANKFUL THRASHER 16 26 40 1040
Figure9.LinkedHouseholdExamples,1870-1880Complete-Count
linked80 name1_70 name2_70 name1_80 name2_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80linked WN AYERS W.N. AYERS head 45 54 male male Ohio Ohiolinked SARAH AYERS SARAHANN AYERS spouse 41 51 female female Vermont Vermont
unlinked
JOHN AYERS child
24
male
Washingtonunlinked
WALTER AYERS child
22
male
Washington
unlinked
HOWARD AYERS child
19
male
Illinoislinked IDA AYERS IDA AYERS child 6 16 female female Illinois Illinois
unlinked
CARRIE AYERS child
14
female
Iowaunlinked
WILLIE AYERS child
8
male
Arkansas
JOHN AYERS
14
male
missing
WALTER AYERS
12
male
missing
HOWARD AYERS
9
male
Iowa
CORA AYERS
4
female
Iowa
linked80 name1_70 name2_70 name1_80 name2_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80unlinked
HENRYC. CUTTING head
46
male
Ohio
unlinked
CORDELIA CUTTING spouse
43
female
Vermontlinked LUCYA CUTTING LUCY CUTTING child 10 20 female female Ohio Ohiolinked WILLIAMH CUTTING WILLIAMK. CUTTING child 7 19 male male Ohio Ohio
unlinked
ANALISCIA CUTTING child
17
female
Ohiolinked SAMUELJ CUTTING SAMUELJ. CUTTING child 4 14 male male Ohio Ohiolinked CORAA CUTTING CORAA. CUTTING child 1 11 female female Ohio Ohio
unlinked
MINNIEI. CUTTING child
9
female
Ohiounlinked
JOHN PETERMAN unrelated
21
male
Ohio
HENRY CUTTING
28
male
Ohio
CORDELIA CUTTING
25
female
Vermont
ANNE CUTTING
5
female
Ohio
linked80 name1_70 name2_70 name1_80 name2_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80linked NATHAN MILLER NATHAN MILLER head 54 63 male male Ohio Ohiolinked MARGARETD MILLER MARGARETD. MILLER spouse 53 63 female female Ohio Ohiolinked CHARLESH MILLER CHARLEN. MILLER child 10 20 male male Ohio Ohiolinked SARAHJ MILLER SARAHM. MILLER child 13 23 female female Ohio Ohio
unlinked
ELWOODC. MILLER child
17
female
Ohiounlinked
LOUIZAJ. MILLER child
29
female
Ohio
linked JOHNW MILLER JOHNW. MILLER child 2 12 male male Ohio Ohiounlinked
JOSEPHINER. MILLER child
10
female
Ohio
ELWOODC MILLER
7
male
Ohio
MINERVA MILLER
21
female
Ohio
ROSETTAJ MILLER
0
female
Ohio
Figure10.LinkedHouseholdExamplesAfterForcedLinkingProcess,1870-1880Complete-Count
linked80 name1_70 name2_70 name1_80 name2_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80
explicit WN AYERS W.N. AYERS head 45 54 male male Ohio Ohio
explicit SARAH AYERS SARAHANN AYERS spouse 41 51 female female Vermont Vermont
explicit IDA AYERS IDA AYERS child 6 16 female female Illinois Illinois
forced CORA AYERS CARRIE AYERS child 4 14 female female Iowa Iowa
forced JOHN AYERS JOHN AYERS child 14 24 male male missing Washington
forced WALTER AYERS WALTER AYERS child 12 22 male male missing Washington
forced HOWARD AYERS HOWARD AYERS child 9 19 male male Illinois Washington
unlinked
WILLIE AYERS child
8
male
Arkansas
linked80 name1_70 name2_70 name1_80 name2_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80
forced HENRY CUTTING HENRYC. CUTTING head 28 46 male male Ohio Ohio
forced CORDELIA CUTTING CORDELIA CUTTING spouse 25 43 female female Vermont Vermont
explicit LUCYA CUTTING LUCY CUTTING child 10 20 female female Ohio Ohio
explicit WILLIAMH CUTTING WILLIAMK. CUTTING child 7 19 male male Ohio Ohio
forced ANNE CUTTING ANALISCIA CUTTING child 5 17 female female Ohio Ohio
explicit SAMUELJ CUTTING SAMUELJ. CUTTING child 4 14 male male Ohio Ohio
explicit CORAA CUTTING CORAA. CUTTING child 1 11 female female Ohio Ohio
unlinked
MINNIEI. CUTTING child
9
female
Ohio
unlinked
JOHN PETERMAN unrelated
21
male
Ohio
linked80 name1_70 name2_70 name1_80 name2_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80
explicit NATHAN MILLER NATHAN MILLER head 54 63 male male Ohio Ohio
explicit MARGARETD MILLER MARGARETD. MILLER spouse 53 63 female female Ohio Ohio
explicit SARAHJ MILLER SARAHM. MILLER child 13 23 female female Ohio Ohio
explicit CHARLESH MILLER CHARLEN. MILLER child 10 20 male male Ohio Ohio
forced ELWOODC MILLER ELWOODC. MILLER child 7 17 male female Ohio Ohio
forced MINERVA MILLER LOUIZAJ. MILLER child 21 29 female female Ohio Ohio
explicit JOHNW MILLER JOHNW. MILLER child 2 12 male male Ohio Ohio
forced ROSETTAJ MILLER JOSEPHINER. MILLER child 0 10 female female Ohio Ohio
Figure11.LinkedHouseholdExamples,1870-1880Complete-Count
name1_70 name2_70 name2_80 name1_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80
WILLIAM FENTON WM.H. FENTON head 35 46 male male NewJersey NewJersey
CORDELIA FENTON CORDELIA FENTON spouse 33 44 female female DC DC
JOHNW. FENTON child
21
male
DC
SAMUEL FENTON SAMUEL FENTON child 9 19 male male DC DC
EMMA FENTON EMMA FENTON child 7 17 female female DC DC
WILLIAM FENTON WILLIAM FENTON child 5 15 male male DC DC
MARY FENTON MAY FENTON child 3 13 female female DC DC
BESSIE FENTON BESSIE FENTON child 1 10 female female DC DC
WALKER FENTON
11
male
Virginia
IDA WALKER
16
female
DC
LOUISA BROWN
27
female
Maryland
name1_70 name2_70 name2_80 name1_80 relate80 age70 age80 sex70 sex80 bpl70 bpl80
WILLIAMJ CANTRELL W.J. CANTRELL head 56 67 male male Georgia Georgia
AMANDA CANTRELL AMANDA CANTRELL spouse 43 54 female female Georgia Georgia
FELIX CANTRELL child
23
male
Georgia
FESTUS CANTRELL child
23
male
Georgia
MARGARETA CANTRELL MAGGIE CANTRELL child 11 20 female female Georgia Georgia
JOHN CANTRELL JOHN CANTRELL child 5 13 male male Georgia Georgia
EVA CANTRELL child
8
female
Georgia
JAMESR CANTRELL
17
male
Georgia
MARGARETF CANTRELL
15
female
Georgia
ABDAF CANTRELL
13
male
Georgia
ABBAF CANTRELL
13
male
Georgia
SUSAN CANTRELL
38
female
Virginia
CHARLES CANTRELL
7
male
Georgia
ARMSTEAD CANTRELL
1
male
Georgia
Figure12.LinkingOlderSons,LivingstonCounty,Illinois,1870-1880CompleteCount
fname70 lname70 age70 fname80 lname80 age80 serial80 serial80diff
CALEB MATHIS 46 CALEB MATHIS 56 *799 0
SOPLENA MATHIS 43 SOFLENA MATHIS 53 SOPLENA MATHIS 9 SOFLENA MATHIS 19 WILLIAM MATHIS 7 WILLIAM MATHIS 16 HOLLAND MATHIS 2 HELLAND MATHIS 12 GEORGE MATHIS 19
JAMES MATHIS 17
ELBERT MATHIS 13
EUGENE MATHIS 12
fname80 lname80 age80 serial80 serial80diff
GEORGE MATHIS 29 *805 6
SARAH MATHIS 27
MAY MATHIS 4
LENA MATHIS 2
CARL MATHIS 1
fname80 lname80 age80 serial80 serial80diff
JAMES MATHIS 27 *819 20
ANNA MATHIS 25
NELIE MATHIS 2
Table1.Linkedhouseholds(top)andindividuals(bottom),St.Louis1880
A. Byhouseholds(HH)
NumberofRelatedinHH
1stEnumeration 2ndEnumeration
NHHs NLinkedHHs Linked% NHHs NLinked
HHs Linked%
1 3,524 505 14.3 3,855 481 12.52 10,650 6,578 61.8 11,599 6,221 53.63 11,043 8,482 76.8 11,502 8,354 72.64 10,721 9,113 85.0 11,039 9,002 81.55 9,546 8,453 88.6 9,729 8,371 86.06 7,193 6,521 90.7 7,500 6,668 88.97 4,835 4,491 92.9 5,046 4,593 91.08 2,988 2,804 93.8 3,194 2,968 92.99 1,620 1,530 94.4 1,673 1,557 93.1
10+ 1,205 1,155 95.9 1,361 1,274 93.6
All 63,325 49,632 78.4 66,498 49,489 74.4
B. Byindividuals
NumberofRelatedinHH
1stEnumeration 2ndEnumerationN
Individuals NLinked Linked% NIndividuals NLinked Linked%
1 3,519 505 14.4 3,814 481 12.62 21,300 12,588 59.1 23,198 11,897 51.33 33,129 23,801 71.8 34,506 23,108 67.04 42,884 34,146 79.6 44,156 33,247 75.35 47,730 39,854 83.5 48,645 38,908 80.06 43,134 36,763 85.2 45,000 36,967 82.17 33,831 29,599 87.5 35,322 29,790 84.38 23,888 20,981 87.8 25,552 21,785 85.39 14,580 12,799 87.8 15,057 12,867 85.5
10+ 12,688 11,147 87.9 14,487 11,896 82.1
All 276,683 222,183 80.3 289,737 220,946 76.3
Note:Relatedreferstohouseholdmembersrelatedtothehouseholdhead,eitherbiologicallyorthroughmarriage
Table2a.Linkedpopulation’sdistributionbysurnamesimilaritymeasures,St.Louis1880
N Dist.(%) NYSIIS Double
Meta Match1 Match2 Match3
Lessthan0.6 2,751 1.2 0.6 3.3 13.9 0.1 0.0
0.60to0.649 2,604 1.2 3.1 7.1 44.0 4.1 0.0
0.65to0.699 3,573 1.6 7.4 16.4 59.1 8.4 0.0
0.70to0.749 6,910 3.1 10.6 18.1 68.3 20.1 1.4
0.75to0.799 9,506 4.3 19.5 29.5 79.7 33.9 7.1
0.80to0.849 15,918 7.2 32.2 40.7 86.1 41.6 18.4
0.85to0.899 20,644 9.3 39.1 47.0 90.2 64.9 34.0
0.90to0.949 27,128 12.2 49.1 57.5 96.4 83.2 68.0
0.95to0.999 25,348 11.4 66.3 74.7 99.5 93.9 86.5
1.00(Exactmatch) 108,048 48.6 100.0 100.0 100.0 100.0 100.0
All 222,430 100.0 69.4 73.6 93.4 80.7 71.5
Table2b.DistributionbyJaro-Winklerscoreforgivennames,St.Louis1880
N Dist.(%) NName
Std.
%NameStd.(byrow)
Lessthan0.6 13,092 5.9 3,080 23.50.60to0.649 3,607 1.6 971 26.90.65to0.699 4,538 2.0 1,695 37.30.70to0.749 6,491 2.9 2,136 32.90.75to0.799 8,407 3.8 3,545 42.20.80to0.849 13,813 6.2 9,415 68.20.85to0.899 14,407 6.5 9,983 69.30.90to0.949 19,464 8.8 14,418 74.10.95to0.999 13,063 5.9 10,461 80.01.00(Exactmatch) 119,595 53.8 0 0.0InitialMatch 5,953 2.7 0 0.0
All 222,430 100.0 55,704 25.0
Table3.Distributionofage,sex,race,birthplaceprecision,St.Louis1880
B. Sex
N Dist.(%)
Agrees 220,323 99.1Disagrees 2,107 0.9
Total 222,430 100.0
D. Ownbirthplace
N Dist.(%)
Agrees 203,785 91.6Disagrees 18,645 8.4Total 222,430 100.0
E. Father’sbirthplace
N Dist.(%)
Agrees 182,620 82.1Disagrees 39,810 17.9Total 222,430 100.0
F. Mother’sbirthplace
N Dist.(%)
Agrees 180,917 81.3Disagrees 41,513 18.7Total 222,430 100.0
A. Agedifference
N Dist.(%)
−2(andgreater)years 13283 6.0−1year 17552 7.9
Sameage 106,275 47.8+1year 61,686 27.7
+2(andgreater)years 23,634 10.6Total 222,430 100.0
C. Race
N Dist.(%)
Agrees 221,904 99.8Disagrees 526 0.2
Total 222,430 100.0
Table4.MigrationStatusforRules-BasedHouseholdLinks,1870-1880Complete-Count
SurnameSimilarity
NLinkedHHs
NonMigrant
(1)
SameStateDifferentCounty(2)
DifferentState(3)
Migrant(2+3)
.90to.909 44,568 75.8 14.3 9.9 24.2
.91to.919 25,779 75.2 14.7 10.1 24.8
.92to.929 38,949 76.1 14.1 9.9 23.9
.93to.939 44,984 76.7 13.8 9.5 23.3
.94to.949 40,668 77.0 13.5 9.5 23.0
.95to.959 38,060 77.1 13.3 9.6 22.9
.96to.969 69,252 76.9 13.5 9.6 23.1
.97to.979 70,961 77.5 13.0 9.5 22.5
.98to.999 7,016 77.6 13.2 9.3 22.4exactmatch 1,173,183 79.0 12.0 9.0 21.0
All 1,553,420 78.5 12.4 9.1 21.5HouseholdUniqueness
Score
NLinkedHHs
NonMigrant
(1)
SameStateDifferentCounty(2)
DifferentState(3)
Migrant(2+3)
<10 821,342 77.5 12.9 9.5 22.510-19 340,343 78.8 12.2 9.0 21.220-29 170,603 79.5 11.7 8.8 20.530-39 100,776 79.9 11.6 8.5 20.140-49 57,573 80.7 11.0 8.3 19.350-59 31,557 81.2 11.0 7.8 18.860+ 31,226 81.7 10.5 7.8 18.3
All 1,553,420 78.5 12.4 9.1 21.5
NpotentialLinksin
Household
NLinkedHHs
NonMigrant
(1)
SameStateDifferentCounty(2)
DifferentState(3)
Migrant(2+3)
3 561,326 76.8 13.0 10.2 23.24 541,916 78.3 12.6 9.1 21.75 272,968 80.0 11.7 8.3 20.06+ 177,210 81.9 10.8 7.3 18.1
All 1,553,420 78.5 12.4 9.1 21.5
Table5a.HouseholdLinkageRate,1870-1880Complete-Count(all1880households)
Numberrulesbasedlinkedhouseholds 1,553,420
NumberofexplicitlylinkedIndividuals 6,473,809
NLinkablein1880Household
N1880Households
N1880HouseholdsLinked %Linked
1 934,251 0 0.02 4,354,712 0 0.03 1,788,843 375,655 20.94 1,280,185 428,408 33.45 832,111 354,198 42.5
6+ 889,856 395,159 44.3All 10,079,958 1,553,420 15.4
Table5b.HouseholdLinkageRate,1870-1880Complete-Count(1880householdswith3ormorelinkablerecordsonly)
RaceandNativity(HouseholdHead)
N1880Households
N1880HouseholdsLinked %Linked
Native-bornwhite 2,918,696 1,133,828 38.7Foreign-bornwhite 1,339,201 337,725 25.2Black 456,746 67,722 14.8Mulatto 71,053 13,872 19.5Other 5,299 273 5.2
All 4,790,995 1,553,420 32.4
Table6.DistributionofNeighborCount(PHHN)ForAllPotentialHouseholdLinks(0.9SurnameThreshold)andForRules-BasedHouseholdLinks,1870-1880Complete-Count
AllPotentialHHLinks(0.9SurnameThreshold)
Rules-BasedHouseholdLinksOnly
NNeighborsinGrid(PHHN)
NPotentialHHLinks Distribution
NNeighborsinGrid(PHHN)
NLinkedHouseholds(Rules-based)
Distribution
%NonMigrant(linked
households)1 12,727,140 59.3 1 317,330 20.4 28.52 4,155,353 19.4 2 124,729 8.0 56.63 1,459,194 6.8 3 67,350 4.3 71.84 546,663 2.5 4 43,488 2.8 80.65 238,992 1.1 5 32,522 2.1 86.66 129,656 0.6 6 27,108 1.7 90.67 88,112 0.4 7 24,699 1.6 92.98 71,170 0.3 8 23,441 1.5 94.49 65,238 0.3 9 23,917 1.5 95.2
10 62,179 0.3 10 24,166 1.6 96.011 61,185 0.3 11 24,453 1.6 96.412 61,229 0.3 12 25,431 1.6 96.513 61,355 0.3 13 25,556 1.6 96.514 61,341 0.3 14 25,788 1.7 96.915 61,751 0.3 15 25,992 1.7 97.016 62,226 0.3 16 26,479 1.7 97.217 62,844 0.3 17 26,998 1.7 97.518 62,307 0.3 18 26,757 1.7 97.719 62,998 0.3 19 27,326 1.8 97.5
20+ 1,345,209 6.3 20+ 609,890 39.3 98.8
All 21,446,142 100.0 All 1,553,420 100.0 78.5
Table7.NumberofLinkedHouseholdsandIndividuals,Rules-OnlyandRulesPlusPHHNGrids
LinkTypeNPotentialLinksin
Household
NHouseholdsLinked
NLinkedIndividuals
%NonMigrant(ByHouseholds)
rulesonly(0.9surname) 3+ 1,553,420 6,473,809 78.5rulesplus(0.9surname) 2 485,800 982,388 97.5rulesplus(0.9surname) 3 87,326 266,460 97.6rulesplus(0.9surname) 4+ 36,400 211,712 96.5rulesonly(0.8surname) 3+ 144,469 879,008 76.9rulesplus(0.8surname) 2 20,418 62,193 96.9rulesplus(0.8surname) 3 65,009 130,914 97.4rulesplus(0.8surname) 4+ 8,902 50,911 94.9
All
2,401,744 9,057,395
Table8.MigrationStatusforRules-BasedHouseholdLinks,1870-1880Complete-Count(Surname0.8only)
SurnameSimilarity
NLinkedHHs
NonMigrant
(1)
SameStateDifferentCounty(2)
DifferentState(3)
Migrant(2+3)
.80to.809 10,870 75.6 14.7 9.7 24.4
.81to.819 6,908 76.9 13.7 9.5 23.1
.82to.829 16,513 75.9 14.1 10.0 24.1
.83to.839 9,119 76.9 13.6 9.5 23.1
.84to.849 14,697 76.2 14.1 9.7 23.8
.85to.859 15,230 76.7 13.7 9.6 23.3
.86to.869 20,393 77.9 12.7 9.3 22.1
.87to.879 10,765 76.8 13.5 9.7 23.2
.88to.889 19,506 77.5 13.2 9.3 22.5
.89to.899 20,468 77.8 12.8 9.5 22.1
All 144,469 76.9 13.5 9.6 23.1HouseholdUniqueness
Score
NLinkedHHs
NonMigrant
(1)
SameStateDifferentCounty(2)
DifferentState(3)
Migrant(2+3)
<10 65,579 76.4 14.1 9.5 23.610-19 28,190 77.5 13.1 9.4 22.520-29 21,886 77.2 12.9 10.0 22.830-39 12,690 76.8 13.5 9.7 23.240-49 7,500 77.5 12.5 10.1 22.550-59 4,126 78.3 12.4 9.3 21.760+ 4,498 81.7 10.5 7.8 22.1All 144,469 76.9 13.5 9.6 23.1
NpotentialLinksin
Household
NLinkedHHs
NonMigrant
(1)
SameStateDifferentCounty(2)
DifferentState(3)
Migrant(2+3)
3 20,871 76.1 13.3 10.6 23.94 71,348 75.5 14.5 10.0 24.55 32,911 78.3 12.6 9.1 21.76+ 19,339 80.1 11.8 7.3 19.1
All 144,469 76.9 13.5 9.6 23.1
Table9a.HouseholdLinkageRate,1870-1880Complete-Count,RulesandRulesPlusPHHNGrids
Numberrulesbasedlinkedhouseholds 2,401,744
NumberofexplicitlylinkedIndividuals 9,057,395
NLinkablein1880Household
N1880Households
N1880HouseholdsLinked %Linked
1 934,251 0 0.02 4,354,712 305,461 7.23 1,788,843 585,676 33.64 1,280,185 572,697 45.85 832,111 442,524 54.5
6+ 889,856 495,387 57.0All 10,079,958 2,401,744 23.9
Table9b.HouseholdLinkageRate,1870-1880Complete-Count,RulesandRulesPlusPHHNGrids(1880householdswith2ormorelinkablerecordsonly)
RaceandNativity(HouseholdHead)
N1880Households
N1880HouseholdsLinked %Linked
Native-bornwhite 5,729,709 1,743,796 30.8Foreign-bornwhite 2,307,905 510,277 22.3Black 943,820 123,500 13.2Mulatto 151,489 23,622 15.8Other 12,784 549 4.3All 9,145,707 2,401,744 26.3
Table10a.Linkedpopulation’sdistributionbysurnamesimilaritymeasures
N Dist.
(%) NYSIIS DoubleMeta Match1 Match2 Match3
0.80to0.849 467,651 4.6 29.7 35.9 82.4 46.8 21.3
0.85to0.899 648,388 6.4 43.8 50.1 90.6 65.7 32.7
0.90to0.949 1,127,925 11.1 54.4 61.7 96.8 84.0 68.9
0.95to0.999 1,076,914 10.6 71.6 74.0 99.7 94.8 86.7
1.00(Exactmatch) 6,920,409 67.4 100.0 100.0 100.0 100.0 100.0
Total 10,241,287 100.0 85.1 86.9 98.2 93.1 87.3
Table10b.DistributionbyJaro-Winklerscoreforgivennames
N Dist.(%)
NNameStd.
%NameStd.(byrow)
Lessthan0.6 554168 5.4 151,288 27.3
0.60to0.649 64808 0.6 6,092 9.4
0.65to0.699 115222 1.1 37,332 32.4
0.70to0.749 162728 1.6 25,386 15.6
0.75to0.799 281654 2.8 92,664 32.9
0.80to0.849 274735 2.7 81,871 29.8
0.85to0.899 419735 4.1 218,682 52.1
0.90to0.949 783995 7.7 61,936 7.9
0.95to0.999 500938 4.9 40,576 8.1
1.00(Exactmatch) 7083299 69.0 0 0.0
Total 10,241,287 100 715,827 7.0
Table11.Distributionofage,birthplace,sexandraceprecision
B. Birthplaceagreement
N Dist.(%)
Agrees 9,983,344 97.5
Disagrees 257,943 2.5
Total 10,241,287 100.0
C. Sexagreement
N Dist.(%)
Agrees 10,200,689 99.6
Disagrees 40,598 0.4
Total 10,241,287 100.0
D. Raceagreement
N Dist.(%)
Agrees 10,179,031 99.4
Disagrees 62,256 0.6
Total 10,241,287 100.0
A. Agedifference
N Dist.(%)
−5(andgreater)years 191,952 1.9
−4years 134,355 1.3
−3years 240,630 2.3−2years 564,267 5.5
−1year 1,957,802 19.1
Sameage 4,940,063 48.2
+1year 1,337,226 13.1
+2years 402,158 3.9+3years 178,233 1.7
+4years 105,844 1.0
+5(andgreater)years 188,757 1.8
Total 10,241,287 100.0