capstone project - u.s. census bureau, center for adaptive...
TRANSCRIPT
CAD11-30-15
CapstoneProject-U.S.CensusBureau,CenterforAdaptiveDesignTitleoftheProject:CombiningTraditionalandNewDataSourcesintheProductionofOfficialStatisticsTiming:Spring2016Client:TheCenterforAdaptiveDesignattheU.S.CensusBureauIssueDefinition:TheU.S.CensusBureauisthelargeststatisticalagencyintheU.S.Government.Inadditiontotheonce-a-decadepopulationcount,theCensusBureaucollectsdataontheU.S.economy,healthcare,poverty,education,crime,andbusinessinnovation,amongothers.ThepredominantmethodusedforcollectingdatabytheCensusBureauisthetraditionalsamplesurveyadministeredtohouseholds,businessesandgovernmententities.Responseratestosamplesurveys,however,aredecliningrapidlywhilesurveycostsareclimbingatanunsustainablerate.Further,theCensusBureauishighlyinterestedinreducingpublicburden(theeffortandtimespentrespondingtosurveys).ThiscombinationofchallengesisdrivingtheCensusBureautoexplorealternativedatacollectionmethods,includingpassiveapproachesusingBigDatatechniques.Theoverallissuethisprojectwillhelpaddressishowtocombinethedatafromnewsourceswithdatacollectedusingthetraditionalapproaches.Thegoalistoincreaseefficiencyandtimeliness,reducecostandburden,andmaintaindataquality.ScopeofWork:Aparticularlineofsurveyquestionsthathaveproventobeproblematicarethosesurroundingthetopicofpeople’stravelandcommutinghabits.Withtheavailabilityoftrafficandcommutingdatafrompublicandprivatesources,aswellascrowd-sourcedapplications(suchasWaze)theCensusBureaushouldbeabletoaccessdatathatmayenableareductionintravelandcommutingquestionsinoursamplesurveys(suchastheAmericanCommunitySurvey).ThisprojectwouldresearchpossibledatasourcestoinferselectedtravelandcommutingstatisticsintheAmericanCommunitySurvey,suggestmethodsandtacticsforaccessingandinterpretingthosesources,andexploreapproachesforverifyingthequalityofthesesourceswiththegoalovertimeofincorporatingthemintoofficialstatistics.Possibledatasourcesincluderoadwaytrafficcounts,traveltime,parking,publictransportation,etc.Workalsoinvolvesdevisingmethodstotestnewdatasourcequalityagainstestablishedsurveydatabenchmarks,aswellassuggestingaviablelong-termplanfortransitioningtonewdatasources.ExpectedDeliverables:
• Researchandrecommendationsonpossiblenewdatasourcesto
complement/replacetargetedquestionsonexistingsamplesurveyssuchastheAmericanCommunitySurvey
CAD11-30-15
• Literaturereviewofexistingworkonincorporatingnewdatasourcesintotravelsurvey
• Documentationofmethods,technology,cost-benefits,andtacticsforaccessingandinterpretingnewdatasources
• Suggestedapproachesforverifyingthequalityofnewdatasources• Documentationofapossiblelong-termplanfortransitioningtonewdatasources
SkillsRequired:DataAnalytics/DataScienceSurveyMethodologyInformationTechnology/SystemDevelopmentAdvisoryBoard:MichaelThieme,ChiefoftheCenterforAdaptiveDesign,U.S.CensusBureauBenjaminReist,AssistantCenterChief,U.S.CensusBureau,CenterforAdaptiveDesign,U.S.CensusBureauStephanieCoffey,MathematicalStatistician,CenterforAdaptiveDesign,U.S.CensusBureauDeadline:12/5/2015FacultyAdvisor:SeanQianProposercontactinfo:Name,email,phone#Clientpointofcontact:[email protected](301)763-9062
CapstoneReport
Spring2016
Advisor
SeanQian
Team
ManikandanPalani|NishaRao|SahilAggarwal|YifeiJiang|YunFu
P a g e 1|52
TableofContents
ExecutiveSummary......................................................................................................................................3
Introduction.................................................................................................................................................3
Background..............................................................................................................................................3
Goal..........................................................................................................................................................4
ProjectMethodology...................................................................................................................................4
Task1:ACSquestion“Howdidthispersonusuallygettoworklastweek?”...........................................5
Task2:ACSquestion“Howmanyminutesdiditusuallytakethispersontogetfromhometoworklastweek?”...................................................................................................................................................10
LessonsLearned...........................................................................................................................................7
SuggestionsforFutureWork........................................................................................................................8
Acknowledgement.......................................................................................................................................8
AppendixI:LiteratureReview......................................................................................................................8
NationalHighwayTravelSurveyDataAnalysis........................................................................................8
FactorsaffectingtraveltimeinUS...........................................................................................................9
BusTransitTime.....................................................................................................................................10
INRIXTrafficData...................................................................................................................................11
UrbanCongestionLevel.........................................................................................................................11
AppendixII:InformationonExternalDataSources...................................................................................13
NHTS:NationalHouseholdtravelsurvey...............................................................................................13
APCAVL-Automaticpassengercounterandautomaticvehiclelocationdata.....................................14
PittsburghParkingTerminals...................................................................................................................1
INRIX.........................................................................................................................................................2
AppendixIII:DataCleaningSteps.................................................................................................................4
APCAVL:Automaticpassengercounterandautomaticvehiclelocationdata........................................4
PittsburghDowntownparkingterminals.................................................................................................4
INRIX.........................................................................................................................................................5
AppendixIV:Models....................................................................................................................................6
LogisticRegression...................................................................................................................................6
MultinomialLogisticRegression...............................................................................................................6
ArithmeticMean......................................................................................................................................6
SimpleLinearRegression.........................................................................................................................6
P a g e 2|52
AppendixV:Code.........................................................................................................................................7
Codeusedtopredictmodeoftransport:Pittsburgh_Mode_Choice.R....................................................7
CodeusedtogettheFIPScodeofstopfromtheCensusAPI:getStopFIPS.py......................................13
Codeusedtofiltermeantraveltimefromtractleveltocountylevel:filter.py....................................14
Codeusedtocalculatethemeantime:calculateMeanTT.py................................................................15
Codeusedtopredictthetravel-time:travel_time.R.............................................................................17
References..................................................................................................................................................18
P a g e 3|52
ExecutiveSummary
This report is oneof the deliverables associatedwith theHeinz CapstoneProject programat Carnegie
MellonUniversity.Inthisprogram,studentsaregroupedinteamsoffiveandassignedaclientforwhom
theywillundertakeaninitiativeoverthecourseofasemester.
OurclientforthisprojectistheCentreforAdaptiveDesignoftheUnitedStatesCensusBureau.TheU.S
CensusBureauistheprincipleU.SFederalStatisticalsystemintheU.S.
OurgoalwastohelpreducethecostofconductingtheAmericanCommunitySurvey.Thisreportisdivided
intoseveralsectionsthatdetailthestepstakenbytheteamtodevelopasolutionthathelpsachievethe
goal.Firstly,itprovidesabriefbackgroundontheteam’objective.Secondly,itdetailsthemodelbuiltby
CMU Capstone team to estimate response to two of the fivework-commute related questions in the
American Community Survey (ACS) using historical ACS data and data from external sources – INRIX,
PittsburghParking.Theteamcomparedtheaccuracyoftheresultsofthemodelusinga20%,40%,60%
and80%samplingplan toobserve theutilityof themodel in scenarioswhereonly20%or40%of the
existingrespondentsweretobecontacted.ThemodelcouldhelptheBureautoreducethenumberof
peoplesurveyedyetobtainreliablyaccurateresponsesfortherestoftheintendedtargetpopulationand
subsequentlyreducesurveycosts.Thirdly,thereportdetailsthefutureworkofincludingfactorssuchas
congestionandweatherthatcouldhelpfurtherrefinethemodel.
Introduction
Background
US Census Bureau conducts the American Community Survey (ACS) on an ongoing basis to collect
information about the American people and their community. This data is used by planners and
entrepreneurstoassessthepastandplanthefuture.Italsohelpsdeterminethedistributionofmorethan
400$billioninfederalandstatefundseachyear.
Thesurveyisadministeredthroughseveralwaystooneinthirty-eightU.Shouseholdsperyear.Theform
can be completed online or on a paper form.U.S Census Bureau has a follow-up procedure in case a
respondentdoesnotrespondwithinastipulatedtimeframe.Thenon-respondenteitherreceivesacallor
apersonalvisitfromtheCensusstafftohelpcompletethesurvey.
Thecostofconductingthesurveyisincreasingwhilethesurveyresponserateisdeclining.ACScurrently
has seventy-seven questions. To reduce respondent burden, Centre for Adaptive Design of US Census
Bureauisresearchingwaystoreducesurveycost.
Ofitsseventy-sevenquestions,ACShasfivequestionsthatcaptureinformationaboutwork-commuteof
therespondents.Thesequestionshaveprovedtobeveryproblematicwhenitcomestoelicitingresponses.
P a g e 4|52
Goal
UndertheguidanceofSeanQian,CMUProfessor,ourgoalofthisprojectwastoaccesswaystoinferthe
travelandcommuterelatedquestionsinACS.
ProjectMethodology
Ofthefivework-commuterelatedquestionsinACS,theteamconcentratedontwoquestions-“Howdid
thispersonusuallygettoworklastweek?”and“Howmanyminutesdiditusuallytakethispersontoget
from home to work last week?”. The objective was to build models to estimate responses to these
questions.
So,theprojectmethodologyhasbeendetailedwithrespecttothetwotasks.
Task1:ACSquestion-“Howdidthispersonusuallygettoworklastweek?”
Summary: The team through its initial research (Appendix I: NHTS Data Analysis) identified the
methodologytodevelopamodeltopredictthemodeoftransportaspublicorprivate.Theexplanatory
variablesforthemodelwereinitiallyidentifiedfrombothNHTSandACSdatasources.
NHTSdataisdetailedandcananswermanyquestionsdirectlybutitslatestdataisfromyear2009.
NHTSisnotconductedasfrequentlyasACS.Duetothisfact,theutilityofthemodelwaslimited.More
detailsareprovidedin"StepI:PredictivemodelusingNationalHouseholdTravelSurveydata"
TheteamthenidentifiedequivalentofNHTSvariablesinACSdata.Theteammembersdeveloped
amodeltopredictthemodeoftransportaspublicorprivateusingsolelythevariablesfromACSdata.They
refinedthemodelfurthertobeabletoidentifythetypeofpublicorprivatemodeoftransporttakenbya
respondent.Theutilityofthismodelwastestedusing20%,40%,60%and80%randomsamplingplan.More
detailsareprovidedin"StepII:PredictivemodelusingPUMSdataofAmericanCommunitySurvey"
Task2:ACSquestion-“Howmanyminutesdiditusuallytakethispersontogetfromhometoworklast
week?”
Summary:UsingtheCommutinginAmericaIIIreport(AppendixII:FactorsaffectingtraveltimeinUS),the
team identified intuitively the variables that could affect the work-commute travel time. The team
membersalsoresearched(AppendixIII:BusTransitTimeandAppendixIV:INRIXTrafficData)onhowto
makeuseoftheexternaldatasourcessuchasBustransitdataandINRIXtrafficspeeddatainthemodelto
predicttravel-time.
The team faced challengeswith respect to data granularity and overcame themby using data
cleaningmethodsandaggregatingthedataattheappropriatelevel.Theteamalsohadtomakequiteafew
assumptionsinordertomakeuseoftheexternaldatasets.Themembersdevelopedamethodtocalculate
meantraveltimeforbustransitatblocklevel.Moredetailsareprovidedin"StepI:Meantraveltimefor
Publictransportation-Businagivenblock".
Thedetailsof the travel-timepredictingmodel areprovided in "Step II: Predictivemodelusing
historicaldatafromACSalongwithinputsfromexternaldataset".
P a g e 5|52
Theteamalsodidsomeresearchoncalculatingthecongestionlevelinurbanareas(AppendixV:
Urban Congestion Level). As per the “Commuting in America III report”, travel time is an attribute of
commuters whereas congestion is an attribute of facilities. Therefore, measures of travel times and
measuresofcongestiondonotnecessarilyconverge.Hencethecongestionfactorhasnotbeendelvedinto
detailandcurrentlynotbeenincorporatedinthetravel-timemodel.
Thedetailsofeachstepineachtaskaredescribedbelow.
Task1:ACSquestion“Howdidthispersonusuallygettoworklastweek?”
ResponseoptionsinACS:
• Car,truck,orvan
• Busortrolleybus
• Streetcarortrolleycar
• Subwayorelevated
• Railroad
• Ferryboat
• Taxicab
• Motorcycle
• Bicycle
• Walked
• Workedathome
• Othermethod
StepI:PredictivemodelusingNationalHouseholdTravelSurveydata
Summary:Thismodelwasbuilttoinitiallyclassifythepopulationintermsofwhetheranindividualuseda
publicoraprivatemodeoftransporttocommutetowork,andoncethatmodelshowedpromise,todig
deeperintothepredictionbybeingabletoalsoidentifywhichparticularmodeoftransportwasused(for
e.g.car,bicycleetc.inprivatevehiclesandbus,trainetc.inpublicmodesoftransport).
Tool:RStudio
InputData:
Source:NationalHouseholdTravelSurvey
Details:
Year Level #ofvariables #ofrecords
2001 National 142 70k
2009 National 150 75k
Data Cleaning Step: Some of the input variables were non-binary categorical variables. The logistic
regression model cannot accept non-binary categorical variables. These variables were converted to
relevantdummyvariablesforuseinthepredictiveprocess.Fore.g.Thevariablefor‘Race’cantakeupto9
differentvalues.Soatleast8dummyvariableswereneededforeachvaluethatthevariablecouldassume.
Aftertakingthedummyvariablesintoaccount,wehadatotalofaround160variables.Thevariablesthat
didnotrealisticallyholdanypromisewereweededouteitherduetotheir inherentdefinitionordueto
their infrequency in the actual data. After removing the variables that had almost zero frequency of
occurrence,theshortlistedvariablestotaledinthe60-80range.
P a g e 6|52
ExplanatoryVariables:
Ø Age
Ø Sex
Ø Race
Ø Region
Ø Urban housing
status
Ø Block housing
status
Ø MSAsize
Ø Educationlevel
Ø Hispanicindicatorpopulation
density
Ø Totalincome
Ø Householdincome
Ø Renterpercentage
Ø Workerstatus
Ø Numberofvehicles
Ø Homeownership
status
Ø Placeofbirth
Selectionlogic:DetailsareprovidedinAppendixI:NHTSDataAnalysis
PredictedVariable:Modeoftransport–PublicorPrivate
Model:GLMbasedlogisticclassifier
ln #$1 − #$
= )* + ),-,,$ + )/-/,$ + ⋯+)1-1,$
whereln #$1 − #$
= logoddsofprivatemodeoftransportused
-,,$ =predictorvariablei
Code: “Details in Appendix V under section “Code used to predict mode of transport:
Pittsburgh_Mode_Choice.R”
Output:
Ø 93%predictionaccuracyfor2001NHTSdataat0.5cutoffasadecisionrule
Ø 95%predictionaccuracyfor2009NHTSdataat0.5cutoffasadecisionrule
Validation:10-foldcrossvalidationmethodwasusedtovalidatetheresults.Ithadanaverageaccuracyrate
of93%.
Challenge:
Ø Highlyskewednatureofdata-92%peopleuseprivatemodeoftransport
Ø Trade-offbetweenaccuracyandthenumberoffalsepositivesfortweakingthedecisionruleand
increasingitbeyond0.5
Theoverall rateofpublic transportuse in theNHTSsamplewasaround7%whichwasroughlyour
model’smisclassificationrate.Thishappenedduetotheinherentlyskewednatureofthedata,where~93%
ofthepopulationusedaprivateformoftransporttocommutetoworkwhereasonly7%usedapublic
modeoftransport.Furthermore,ourmodel’sROCcurveshowed26%greaterrecallthanaflipping-of-coin
basedmodel.
Conclusion: The most important and most correlated variables (with the mode of transport used to
commutetowork)areAge,Sex,EducationlevelandTotalIncome-allofwhichmadeperfectintuitivesense.
P a g e 7|52
TheNational Household Travel Survey data can be used to estimate cannot be used as an alternative
source.
StepII:PredictivemodelusingPUMSdataofAmericanCommunitySurvey
Summaryofthemodel:Thismodelwasbuilttoclassifythepopulationintermsoftheparticularmodeof
transport(fore.g.car,bicycle,train,etc.)usedtocommutetowork.
Tool:RStudio,AmazonWebService
InputData:AmericanCommunitySurvey’shistoricaldata
TheAmericanCommunitySurvey (ACS)PublicUseMicrodataSample (PUMS) filesarea setof
untabulatedrecordsaboutindividualpeopleorhousingunits.TheCensusBureauproducesthe
PUMSfilessothatdatauserscancreatecustomtablesthatarenotavailablethroughpretabulated
(orsummary)ACSdataproducts.
DataCleaningStep:
Ø TheteamextractedthePittsburghCitydatafromthewholenationalwidedatasetbasedonthe
PUMAcodeofPittsburghCity(01701and01702).
Ø Forthemissingvalues,wereplacedthemwith0or01accordingtocertaincircumstances.
ExplanatoryVariables:
Thereare512variablesinthedataset,theteamthenselectedthevariablesthatmostinfluence
whichtransportmodethepersonismostlikelytouseforbothPittsburghCityandU.S.national
wide.
VariablesforPittsburghCity:
Name Explanation
JWMNP Traveltimetowork
FMRGIP Firstmortgagepaymentincludesfire,hazard,floodinsuranceallocationflag
MIG Mobilitystatus
PINCP Totalperson'sincome
COW Classofworker
JWRIP Vehicleoccupancy
INTP Interest,dividends,andnetrentalincomepast12months
FLANXP LanguageotherthanEnglishallocationflag
PWGTP Person'sweight
SPORDER PersonNumber
PERNP Totalperson'searnings
HINS2 Insurancepurchaseddirectlyfromaninsurancecompany
P a g e 8|52
TYPE Typeofunit
FS Yearlyfoodstamprecipiency
RACASN Asianrecode(Asianaloneorincombinationwithoneormoreotherraces)
RNTM Mealsincludedinrent
JWTR(Target) Meansoftransportationtowork
VariablesforU.S.NationalWide:
Name Explanation
JWMNP Traveltimetowork
POWSP Placeofwork-Stateorforeigncountryrecode
HINS7 IndianHealthService
Wgtp HousingWeightreplicate3
REFR Refrigerator
RACBLK BlackorAfricanAmericanrecode(Blackaloneorincombinationwithoneormoreother
races)
MHP Mobilehomecosts(yearlyamount)
SSP SocialSecurityincomepast12months
PUMA Publicusemicrodataareacode(PUMA)
OCCP Occupationrecode
POWPUMA PlaceofworkPUMA
FINDP Industryallocationflag
SMOCP Selectedmonthlyownercosts
HINCP Householdincome(past12months)
OCPIP Selectedmonthlyownercostsasapercentageofhouseholdincomeduringthepast12
months
JWTR Meansoftransportationtowork
SelectionLogic:Sameastheonementionedintheprevioussection.
PredictedVariable:Modeoftransport–Car,Bus,Streetcar,Subway,Railroad,Ferryboat,Taxicab,
Motorcycle,Bicycle,Walked,Workathomeorothermethod
Model:
MultinomialLogisticRegressionisthelinearregressionanalysistoconductwhenthedependentvariableis
nominal with more than two levels. Thus it is an extension of logistic regression, which analyzes
dichotomous(binary)dependents.
P a g e 9|52
Also,inordertopreventtheoverfittingproblem,weconducted10-foldercrossvalidation.Cross-validation
isamodelvalidationtechniqueforassessinghowtheresultsofastatisticalanalysiswillgeneralizetoan
independentdataset.Itismainlyusedinsettingswherethegoalisprediction,andonewantstoestimate
howaccuratelyapredictivemodelwillperforminpractice.Inapredictionproblem,amodelisusuallygiven
adatasetofknowndataonwhichtrainingisrun(trainingdataset),andadatasetofunknowndata(orfirst
seendata)againstwhichthemodelistested(testingdataset).Thegoalofcrossvalidationistodefinea
datasetto"test"themodelinthetrainingphase(i.e.,thevalidationdataset),inordertolimitproblems,
likeoverfitting,giveaninsightonhowthemodelwillgeneralizetoanindependentdataset.
Code:
DetailsinAppendixVundersection“Codeusedtopredictmodeoftransport:
Pittsburgh_Mode_Choice.R”
Output:
PittsburghCity
U.S.NationalWide
Validation:
Duetothehighlyskewedoriginaldataset(overall95%ofthehouseholdsusecartocommute),theteam
hastorebalancethedataset.Wespecificallyextractedhalfofthedataascommuterswhousecarandthe
resthalfascommuterswhodon'tusecar.
ForPittsburghCity,theteamthenrandomlysubsetthedatasetintosamplesof20%,40%,60%and80%
and ran simulation 1000 times to compare the accuracy. However, for U.S. National wide, even if we
deployedAWSandlaunchedthelargestinstance(x10.Large)onit,westillcouldn'trunthewholenational
dataset.Soalternatively,werandomlysubsetthedatasetintosamplesof10%,20%,30%,40%and50%,
andransimulationonly1timestocomparetheaccuracy.
P a g e 10|52
Challenge:Largevolumeofdata
Conclusion:
ForPittsburghCitydata,theaverageaccuracyafter1000timessimulationrunsshowsthatpartial
samplingcanleadtoroughlylesseraccuracy.However,forU.S.NationalWidedata,wecannot
seeaclearpattern.Webelievethatthereasonbehindofthisphenomenonisthelackofsimulation
times.Giventhebudgetandtimerestriction,werealizedthatthereisalimitofcomputerpower.
Task2:ACSquestion“Howmanyminutesdiditusuallytakethispersontogetfromhome
toworklastweek?”
StepI:MeantraveltimeforPublictransportation-Businagivenblock
Summary:Usethearithmeticmeanformulatocalculatethemeantraveltime
Tool:ExcelandPython
Input:APCAVL(Automaticpassengercounterandautomaticvehiclelocation)data
Year Level Location
September2013 City Pittsburgh
DataCleaningStep:Eliminatemissingvaluesbyfilteringoutthoserecordsfromtheinput.
Inputvariables:
Ø PeoplegettingonthebusatastopØ PeoplegettingoffthebusatastopØ CurrentloadinthebusØ StopIDØ BlockIDorFIPScodeØ Timetakenbetweenthestops
Ø Dwelltimes
SelectionLogic:Majorvariableswereuseddirectlyorindirectlyforcalculatingmeantraveltime.Therewas
novariableselectionmethodsused.Thisisdirectcalculation.
PredictedVariable:meantraveltimeperpersoninagiventract(FIPScode)
Model:Formulaforcalculatingthemeantraveltimeisgivenbelow.
P a g e 11|52
MTT:Meantraveltimeofpeoplewhousepublictransportinaparticularblock
n:Numberofpeoplewhogetonthebus
t:Differenceintimetakenbetweengettingonthebusandgettingofthebus.
Code:Pythoncodefiles:CodeisinAppendix.
Ø getStopFIPS.py:thisistogettheFIPScodeofstopfromtheCensusAPI
Ø filter.py:Filteredmeantraveltimefromtractleveltocountylevel
Ø calculateMeanTT.py:Thisistocalculatethemeantime
Assumptions:Ø Locationatwhichapassengerboardsisassumedtobethatpassenger’shome/worklocationin
thatparticulartract.
Ø Thismaynecessarilynotbetrue.Itisimportanttorealizethatapersoncangetinandgetdownat
randomstopsaccording tohiswish.Though therearepatterns in thedata, there isnoway to
determinethesourceorthedestinationofapassengerfromthisdata.Thisaffectstheaverage
traveltimecritically.Thereisnodatatoidentifythepersoninquestionorforwhatpurposethe
persontravelsinthepublictransport.Meantraveltimecannotbecomputedatanindividuallevel.
Ø FIFOassumptionvisualized
Figure8section2
Output:
Meantraveltimevisualized
P a g e 12|52
Figure9Section2
Figure10Section2–ThevisualizationofmeantravelforbustransportatblocklevelforPittsburgh
P a g e 13|52
Challenge:
Ø ThebustransitdataonlyhasstopIDnumbersandstopnamewithnocoordinates.
Ø FindingthecoordinatesforthestopIDwasadifficulttask.TheteamsearchedseveralAPIs
includinggooglemapsandfinallyfoundthevaluesinPittsburghcityGISdata.
Ø JoinedthePittsburghcityGISdataandthebustransitdatausingstopIDasprimarykey
andobtainedthecoordinatesforthestop.
Ø Once the coordinateswereobtained, the coordinateswereused to get the FIPS code,
whichidentifiestheblocktowhichthestopbelongs.
Ø AllthebusstopswerethenassignedtotheirrespectiveblockusingFIPScode.
Ø ThestopsweregroupedaccordingtoFIPScode.
Ø Oncethisgroupingwasdone,averagetraveltimeforeachFIPScodewascalculatedusing
thepythoncode:calculateMeanTT.py
Conclusion:
Though,wemakeseveralassumptionstherelativemeantraveltimevaluesfordifferentblockscanstillgive
usagoodideaaboutthetrendsintraveltimeacrossblocks.
Forinstance,Meantraveltimeisstopsindowntownandquitelongforstopsinruralareas.Belowfigure
shows,thevisualizationRedindicatesmoretime.Blueindicateshighertime.
P a g e 1|52
StepII:PredictivemodelusinghistoricaldatafromACSalongwithinputsfromexternaldataset
Summary:Theteamtriedtoanswerthesequestionsbybuildingalinearregressionmodelbycombining
variablesfromACSdataandtheExternalDataSources.Themainmotivewastofindtheeffectofadding
ExternalDataSourcevariableontheACStraveltimeforalltractswithinPittsburgh.
Tool:RStudio,WEKA
Input:
DataSource TimePeriod Level Location #ofrecords #ofvariables
ACS 5year
estimates
CensusTract Pittsburgh 22 33
Pittsburgh
ParkingTerminals
2013and2014 City Pittsburgh 1
APCAVL 2013and2014 City Pittsburgh 1
DataCleaningStep:
Inputvariable:DetailsaboutcalculationofMeantraveltimeforpublictransportationismentionedinthe
previoussection.
Variablename Description(Allvariablesareintractlevel)
TRACTCE10 TRACTID
PublicMTT Pittsburghpublictransportmeantraveltimecalculatedfromexternal
dataset
ParkingTransactions NumberofparkingtransactioninPittsburgh
PopulationTotal Totalpopulation
IncomePerCapita Percapitaincome
MeansofTransportTotal Totalnumberofvehicles
Cartruckvan Numberofcars,trucksorvan
DroveAlone Numberofpeoplewhodrovealone
Carpooled NumberofPeoplewhocarpooled
Publictransport NumberofPeoplewhousedpublictransport
Bicycle NumberofPeoplewhousedbicycle
Walked Numberofpeoplewhowaked
Othermeans Numberofpeoplewhousedothermodesoftransport
Workedathome Numberofpeoplewhoworkedfromhome
Lessthan5minutes Numberofpeoplewhotooklessthan5minutestotraveltowork
5to9minutes Numberofpeoplewhotookbetween5to9minutestotraveltowork
10to14minutes Numberofpeoplewhotookbetween10to14minutestotravelto
work
15to19minutes Numberofpeoplewhotookbetween15to19minutestotravelto
work
P a g e 2|52
20to24minutes Numberofpeoplewhotookbetween20to24minutestotravelto
work
25to29minutes Numberofpeoplewhotookbetween25to29minutestotravelto
work
30to34minutes Numberofpeoplewhotookbetween30to34minutestotravelto
work
35to39minutes Numberofpeoplewhotookbetween35to39minutestotravelto
work
40to44minutes Numberofpeoplewhotookbetween40to44minutestotravelto
work
45to59minutes Numberofpeoplewhotookbetween45to49minutestotravelto
work
60to89minutes Numberofpeoplewhotookbetween60to89minutestotravelto
work
90ormoreminutes Numberofpeoplewhotookbetween90minutestotraveltowork
White Raceidentifier
AfricanAmerican Raceidentifier
Asian Raceidentifier
Other Raceidentifier
Male Numberofmales
Female Numberoffemales
AvgTT Averagetraveltime
Selection Logic: This is detailedunder the section “Factors affecting travel time inU.S” in “Appendix I:
LiteratureReview”.
Predictedvariable:Traveltimeattractlevel
Model:Linearregression
Code: Lm (formula = X$Avg_TT ~X$Public_MTT + X$Parking_Transactions + X$Population_Total +
X$Means_of_Transport_Total,data=X)
Output:
Belowistheresultoftheregressionmodelwithoutexcludingthehighlycorrelatedvariables.
P a g e 1|52
Figure11Section2
ItcanbeclearlyseenthatRsquareishigh.RsquareexplainsthevariabilityinthemodelandaHighRsquare
value(closeto1)impliestherearemanyvariablesthatareexplainingthevariationinthemodel.Pvalue
which denotes the significance is also lesser than 0.05. Therefore, we decided to exclude the highly
correlatedvariablesandruntheregressionmodelonlywiththeessentialvariables.Thetablebelowshows
thehighlycorrelatedvariablesthatwereexcluded.
P a g e 1|52
Figure12Section2
OutputofRegressionrunsafterexclusion:
Figure13Section2
GoodnessofFit:
Consideranylinearregressionmodel,whichlookslikethefollowing
Are all the assumptions - Normality of residuals or errors from the model, constant residual variance
throughouttherange–true?Itismostlyunlikely.Itisonlyplausiblethattheassumptionsarecloseenough.
Goodnessoffitdefineshowcloselytheassumptionsforthemodelareusefulinpractice.
Thereareseveralmeasuresforgoodnessoffittheyare:
• Examiningresidues
• AglobalmeasureofvarianceexplainedbyR2
• AglobalmeasureofvarianceadjustedfornumberofparametersinthemodelcalledadjustedR2
Residualscanbeuseddescriptively,bylookingateitherhistogramsorscatterplotsofresiduals.Consider
thefollowingmodel-
TheBCDresidualfortheBCDobservationisgivenby
WhereYiistheobserveddependentvariableandXijistheobservedcovariatesfortheBCDobservation.Nowletusplotthegraphandseetheoutcomes.
P a g e 2|52
Fit = lm(X$Avg_TT~
X$Public_MTT+X$Parking_Transactions+X$Population_Total+X$Means_of_Transport_Total)
>hist(fit$residuals,main="Histogramforresiduals",xlab="residuals")
>plot(X$Population_Total,fit$residuals)
>plot(X$Public_MTT,fit$residuals)
>plot(X$Parking_Transactions,fit$residuals)
>plot(X$Means_of_Transport_Total,fit$residuals)
P a g e 3|52
P a g e 4|52
Thehistogramisnotstronglynormalinshape,but,thenagain,therearejust22Observations.Scatterplots
lookreasonable,i.e.,noobviousdeparturesfromConstantvarianceorlinearity.
R2:Aglobalmeasureof“varianceexplained”
R2valueisrepeatedlyusedinregressiontoexplainthefit.Wewillnowseeexactlyhowtointerpretthis
measure.Considerfirstsimplelinearregression. Weareusually interestedinwhethertheindependent
variableisworthhaving,soreallywearecomparingthemodel
Y=α+βXtothesimplermodelY=α,usingtheinterceptonly.
Wedefine:
R2=1–{(sumofsquaredresidualsfrommodelwithαandβ/(sumofsquaredresidualsfrommodelwithαonly)}
Or
R2=1–{SS(res)/SS(total)}
whereSS(res)(oftenalsoreferredtoasSSEor“sumofsquaresoferror”)isdefinedasthesumofresidual
distancesfrommodelwithX,andSS(total)isthesame,butfromtheinterceptonlymodel.
Bydefinitionofleastsquaresregression,SS(res)≤SS(total),becauseifthebestregressionlinewasreally
usingαonly,thenSS(res)=SS(total),andinallothercases,addingβimprovesSS(res).
So,0≤R2
≤1.IfSS(res)=SS(total),thenR2
=0,andmodelisnotuseful.IfSS(res)=zero,thenR2=one,
andmodelfitsallpointsperfectly.Almostallmodelswillbebetweentheseextremes.
P a g e 5|52
Therefore,SS(res)showshowmuchcloserthepointsgettothelinewhenβisused,comparedtoaflatline
usingαonly(whichisalwaysY=α=Y).
Becauseofthis,wecancallR2the“proportionofvarianceexplainedbyaddingthevariableX”.
Essentiallythesamethinghappenswhenthereismorethanoneindependentvariable,exceptresidualsare
fromthemodelwithallXvariablesforthenumeratorinthedefinitionofR2.
Thus,R2
givesthe“proportion
ofvarianceexplainedbyaddingthevariablesX1,X2...Xp,iftherearepindependentvariablesinthemodel.
HowlargedoesR2
needtobetobeconsideredas“good”?Thisdependsonthecontext;thereisnoabsolute
answerhere.ForhardtopredictYvariables,smallervaluesmaybe“good”.Overall,R2providesauseful
measureofhowwellamodelfits,intermsof(squared)distancefrompointstothebestfittingline.However,
asoneaddsmoreregressioncoefficients,R2nevergoesdown,eveniftheadditionalXvariableisnotuseful.
Inotherwords,thereisnoadjustmentforthenumberofparametersinthemodel.
AdjustedR2
Inasimplelinearregression,pisthenumberofindependentvariables.
Itp=1thenAdjustedR2
=R2
.Asthenumberofparametersincreases,AdjustedR2
≤R2
.
Withthisdefinition:
R2=1–{(n−1)×sumofsquaredresidualsfrommodelwithαandβ/(n−p)×sumofsquaredresiduals
frommodelwithαonly}
Therefore, there issomeattempttoadjust for thenumberofparameters.Letusseethis forourmodel
below.
Call:
lm(formula=X$Avg_TT~X$Public_MTT+X$Parking_Transactions+
X$Population_Total+X$Means_of_Transport_Total)
Residuals:
Min1QMedian3QMax
-1.80948-0.614350.077580.522862.60368
Coefficients:
EstimateStd.ErrortvaluePr(>|t|)
(Intercept)9.301e+007.183e-0112.9503.11e-10***
X$Public_MTT7.227e-012.826e-012.5570.0204*
X$Parking_Transactions-4.102e-051.912e-05-2.1450.0467*
X$Population_Total-4.064e-033.614e-04-11.2452.70e-09***
X$Means_of_Transport_Total7.996e-036.955e-0411.4961.93e-09***
P a g e 6|52
---
Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’1
Residualstandarderror:1.155on17degreesoffreedom
MultipleR-squared:0.9006, AdjustedR-squared:0.8772
F-statistic:38.49on4and17DF,p-value:2.607e-08
Rreport:R2
=0.9006andAdjustedR2
=0.8772,so,ineithercase,about90%ofthetotalvarianceisexplained
bythevariablesused,whichisveryhigh.Atleastbythesemeasures,themodelfitswell.
Challenge:
Weinitiallycollectedthetractleveldataforthe33selectedvariablesandcombinedthemintoonefile.138
tractsfallunderAllegany-Pittsburghborders.
Oncethisdatawascollated,weusedtheTRACTIDandTRACTnameaskeytocombinetheExternalData
SourcevariablestotheACSdataset.
ParkingdataandAPCAVLdataarecollectedbydifferentauthorities.Theydonothaveacommonground
forcombining.Whenthedatasetsarebroughttoacommonlevelforexample,attractlevelitresultsin
missingdata.
ThemainreasonforthisisparkingdataisnotavailableforalltractsinPittsburghi.e.parkingspotsarenot
spreadacrossallareasinatractinPittsburgh.Similarly,buslinestravelacrossparticularroutesanddata
forsometractsarenotavailable.Whencombinedthisnaturallyresultsindatanotbeingavailableforsome
tracts.
Oncethedatasetswerecombineditresultedindataonlyfor22outof138Pittsburghtracts.
Thissamplesizeof22isextremelysmalltorunaregressionmodel.However,takingintoaccountthatthis
isonlyaproofofconceptwebuiltaregressionmodelusingR.
The regressionmodelwas initially runwith all the intuitively selected variables forACS alongwith the
externaldatasetvariables.
However,thismodelhadlargenumberofvariablesthatwerehighlycorrelatedandeffectsoftheexternal
datasetvariablescouldnotberealizedclearly.
Soweexcludedthehighlycorrelatedvariablesandranaregressionmodelonlywith4variables.2from
external data set. Publicmean travel time and number of parking transactions per tract and the ACS
variables.Totaltractpopulationandnumberofvehicles inatract.Theresultsarediscussedinthenext
page.
Conclusion:
Ø ThefirstRsquarevalueintheregressionresultsafterexcludingthesignificantvariables
is very high. Itmeans the existing variables in themodel explain the variability in the
P a g e 7|52
externaldatasetverywell.Thep-valueisalsolessthan0.05anditmeansthevariables
aresignificant.
Ø Modeoftransportforacommutercanbepredictedwithjustafractionofthecurrent
sampleavailabletoagooddegreeofaccuracy.
Ø Meantraveltimeforpeopleusingbusaspublictransportationhasbeencalculatedfor
blocklevel.Othercalculations,involvingdatafromexternaldatasource,includeparking
transactionsforall tractswithinPittsburghandaveragespeeddataandaveragetravel
time.
Ø UsingexternalandexistingACSdatameantravel timeforPittsburghtract level iswell
predicted. External data sets are not uniform. So, data needs to be cleaned and
normalized.Normalizationmayresultinaverysmallsample.Asaresult,thiscanonlybe
used as a proof of concept, further research is encouraged and results need to be
scrutinized.
Conclusion
Ø Mode of transport for a commuter can be predictedwell, evenwith a fraction of the
currentsample.
Ø MeantraveltimeatPittsburghtractleveliswellpredictedusingacombinationofAPCAVL
andPAparkingwiththeACSdata(87%ofthevariabilityexplained)
LessonsLearned
Data
Realworlddataisextremelydifferentandmultifaceted.Datasetsarecomplexaswellaslargein
volume. Useful information must be extracted from raw data. Even after extracting useful
informationmissing values and normalization issues can occur. Large data sets further create
problems when it comes to computing. They require huge processing power and distributed
computing.Laptopsandminicomputerscannothandlesuchdatahencetheyneedtobemoved
tothecloud.Thisresultsinextracosts.Sobetterandmoreefficientcodeneedstobedeveloped.
Assumptions
Whilebuildingmodelsusingdatamining severalassumptions likeFIFO,Queueareused these
assumptionscangivemodelsthatareverygoodforcreatingproofofconceptsbutnotentirely
accuratemodels.Forbuilding,moreaccuratemodels,oneneedstodeveloporusecustomized
machinelearningalgorithmsbasedonthedataset.
P a g e 8|52
SuggestionsforFutureWork
Travelmodemodel can be expanded to national level using the nationwide datawithmore
computingpoweranddistributedcomputing.
Expandlinearregressionmodeltopredicttraveltimefor
Ø AlltractsofPittsburgh.
Ø Statewideandnationwideusingotherexternaldatasets.
Travel mode model can be further refined by including the congestion and weather as
independentvariables.
Acknowledgement
ThiscapstoneprojectwasundertakenbytheteamforCentreforAdaptiveDesign,USCensusBureau,with
support fromCarnegieMellonUniversity.We thankour advisor Professor SeanQian for his continued
supportandguidance.Wealsothankourclientfortheirpatienceandsupport.
Thisprojectallowedsomeofustohoneouranalyticalskillswhileothersgainedanunderstandingofthe
analytics field.We thank our Heinz college, CarnegieMellon University, for providing us this learning
opportunity.
AppendixI:LiteratureReview
NationalHighwayTravelSurveyDataAnalysis
Source: The Impacts of Socio-Economic and Demographic Shifts in Transit Served Neighborhoods OnModeChoiceAndEquitybyStevenApell
P a g e 9|52
Highlighttheimpactsofsocio-economicanddemographicchangesonTOD.TODstandsforTransit-oriented
development. It is amixed-use residential and commercial areadesigned tomaximize access topublic
transport,andoften incorporates features toencourage transit ridership.Thepaperalsoevaluates the
overalleffectivenessofTODpolicyinsupportingasustainablecommunity.
Answeredfollowingfourquestions:
Ø Howhasthesocio-economicanddemographiccharacterofcommunitieschangedinblockgroups,
whicharewithin0.5mileand0.5-1.0mileofTODandnon-TODstations?
Ø IsthepresenceofTODinterrelatedwiththeoccurrenceofgentrificationinblockgroupswithin0.5
and0.5-1.0mileoftransitstations?
Ø Howhavechangesinsocio-economicanddemographicfactorsinfluencedmodechoiceinblock
groupswithin0.5and0.5-1.0mileradiusofTODandnon-TODstations?
Ø Whatarethedifferencesbetweensocio-demographiccharacteristicsofcommunitieslivinginTOD
comparedwith communities innon-TOD; further,whatdifferencesexistbetweencommunities
within0.5milesofTODcomparedwithcommunitieslivingwithin0.5-1.0mileradius?
Helpedidentifythefollowingfactorsthataffectthemodeoftransportchosenbyacommutertotraveltowork:
Ø Percentageofhouseholdswithnocarormorethanonecar
Ø Medianandpercapitaincome
Ø Percentageofcollegegraduates
Ø Medianhousingvalue
Ø Mediancontractrent
Ø Percentageofownerandrenteroccupiedhousing
Ø Racialcategories(PercentagesforBlackandWhite)
Ø Percentageofcollegegraduates
Ø EmploymentType
Ø LengthofResidency
FactorsaffectingtraveltimeinUS
Source:CommutinginAmericaReportIII-TheThirdNationalReportonCommutingPatternsandTrendsbyAlanE.Pisarski
Thisreportexaminescommutingpatternsintermsoflongstandingtrendsandemergingfactorsthataffect
commutingeveryday.Itexaminesthedatatrendidentifiedthroughtheresponsestoeachtravelquestion
ofACSindetail.
Traveltimeisafunctionofbothspeedanddistance.Theincreaseinworktripdistanceswhenmatchedby
the increase inworktravelspeeds indicatesanactual improvement inworktravelspeeds.Travel times
increasingfasterthanthetraveldistancesmightindicatetheeffectofcongestioninthetraveltime.Travel
distances growing faster than travel speeds might indicate that the commuter prefers staying in a
neighborhoodwithreducedhousingcostandthisneighborhoodmightbeataconsiderabledistancefrom
hisworklocation.
P a g e 10|52
Ithasalsobeenobservedinpastthatwomenmakefrequentstopswhiletravellingandassuch,theirtravel
timewillbemore.Areassmaller in sizewill tend tosendat least someportionofcommuters toother
metropolitanareasforwork.Familieswhohavetoddlerstravellessandalot.
Ithelpedustointuitivelyidentifythefactorsthatcouldaffecttravel-timeofacommuter.Thevariablesidentifiedwereasfollows–
Ø Age
Ø Gender
Ø MaritalStatus
Ø Presenceandageofownchildren
Ø Gavebirthtochildwithinpast12months
Ø Frequentresidenceshifters
Ø Citizenshipstatus-categorical
Ø Education
Ø Income
Ø Numberofvehiclesinthehousehold
Ø Monthlyrent
Ø Numberofbreadearnersinhousehold
BusTransitTime
Source:ImpactofTrafficCongestiononBusTravelTimeinNorthernNewJerseybyClaireE.McKnight,HerbertS.Levinson,KaanOzbay,CamilleKamga,andRobertE.Paaswell
Thispaperdetails the impactofcongestiononthebustravel timerate. Italsoprovidesdetailsabouta
regressionmodeldevelopedtoestimatetraveltimerate(inminutespermile)ofabusasafunctionofcar
traffictimerate,numberofpassengershoardingpermile,andthenumberofbusstopspermile.
Itdetailsthemethodologyinthefollowingway-
Ø Giveabriefoverviewofthepreviousstudiesconducted
Ø Describethedatacollectionandinitialanalysisofdata
Ø Definethemodel
Ø Statetheconclusions
PredictBustraveltimeforaparticularroute,particularbusandparticulartimeslot.
Aroutesegmentisdefinedasasectionofroutebetweentwoadjacenttimepoints,withatimepoint(TP)
beingthelocationatwhichtheschedulehadarecordedtime.
Busspeedvariationbetweenstopscanindicatecongestion.A.mpeak–7amto10am;Midday–10am
to4pm;P.mpeak–4pmto7pm;Postp.mpeak–after7pm.
Factors Characteristics
Bus Ø #ofstopsØ StopspacingØ Dwelltimesatstops
P a g e 11|52
Ø #ofpassengersboardingØ #ofpassengersalighting
Route Ø SegmentlengthØ #oftrafficsignalsØ #ofleftturnsinroute
Traffic Ø Cartraveltimerate=trafficlevelØ #ofvehiclesinqueuewaitingtomakeleftturnØ #oftaxismakingsuddenstopsorturnstopickup/dropoffpassengersØ #ofvehiclesontheroad
Parking Ø #ofdoubleortripleparkedcars
INRIXTrafficData
Source:INRIXInterfaceGuideDecember2014
ThisreportprovidesdetailedinformationabouttheINRIXdata.Italsoactsasaguidetohelpunderstand
howtomakeuseofINRIXdata.
Itcontainsthefollowinginformation-
Ø InterpretingTMCcodes
Ø UnderstandingXDsegments,sub-segmentsandINRIX-managedsetfiles
Ø Integratinggraphicaltrafficdatainuserapplications
Ø Generatingspeedbuckets
Thedatadictionary included in this reporthelpedus tounderstand the variables in the inputdata file
containingtheINRIXtrafficspeeddata.
UrbanCongestionLevel
Source:2015UrbanMobilityScorecard
It is published jointly by the Texas A&M Transportation Institute and INRIX. This report details the
methodologyusedtoestimatethecongestionlevelatanurbanarea-widelevel.Thisallowsforcomparison
ofcongestionlevelinaconsistentwayacrossurbanareas.
Itusesacombinationoftrafficvolumedataalongwiththetrafficspeeddatatocomputethecongestion
levelmeasures.Allthemeasuresandmanyoftheinputvariablesforeachyearandeverycityareprovided
inaspreadsheetthatcanbedownloadedathttp://mobility.tamu.edu/ums/congestion-data/.Theroadway
inventorydatasourceformostofthecalculationsistheHighwayPerformanceMonitoringSystemfromthe
FederalHighwayAdministration.TrafficvolumedataissourcedfromINRIX.
Thefollowingstepswereusedtocalculatethecongestionperformancemeasuresforeachurbanroadwaysection.
Ø ObtainHPMStrafficvolumedatabyroadsection
Ø MatchtheHPMSroadnetworksectionswiththeINRIXtrafficspeeddatasetroadsections
P a g e 12|52
Ø Estimatetrafficvolumesforeachhourtimeintervalfromthedailyvolumedata
Ø Calculateaveragetravelspeedandtotaldelayforeachhourinterval
Ø Establishfree-flow(i.e.,lowvolume)travelspeed
Ø Calculatecongestionperformancemeasures
Ø Additionalstepswhenvolumedatahadnospeeddatamatch
Themobilitymeasuresrequirefourdatainputs:
Ø Actualtravelspeed
Ø Free-flowtravelspeed
Ø Vehiclevolume
Ø Vehicleoccupancy(personspervehicle)tocalculateperson-hoursoftraveldelay
This paper could be referred in future to calculate the congestion factor to be included as another
independentvariableinthetravel-timepredictingmodel.
P a g e 13|52
AppendixII:InformationonExternalDataSources
NHTS:NationalHouseholdtravelsurvey
Data:NHTS2001and2009
NHTS isatravelsurveythatcollects informationaboutpeople,especiallyan individual’stravelbehavior
over a period through questionnaire. These surveys collect information about demographics of the
individual. Especially information about socio-economic status, household (size and structure), vehicle
data,travelmode,vehiclesowned,vehiclesused,purposeofthejourney,startingpointandendingpoint
ofthejourneyandnumberofpeoplewhothepersontravelledwith.
TheNationalHouseholdTravelSurvey(NHTS)providesinformationtoassisttransportationplannersand
policymakerswhoneedcomprehensivedataontravelandtransportationpatternsintheUnitedStates.
The 2009 NHTS updates information gathered in the 2001 NHTS and in prior Nationwide Personal
TransportationSurveys(NPTS)conductedin1969,1977,1983,1990,and1995.
Datacollected:
TheNHTS/NPTSactsasnation’sinventoryoftraveldata.Thedataiscollectedondailytripstakenina24-
hourperiod,andincludes:
Ø Purposeofthetrip(work,shopping,etc.)
Ø Meansoftransportationused(car,bus,subway,walk,etc.)
Ø Traveltime
Ø Timeofdaywhenthetriptookplace
Ø Dayofweekwhenthetriptookplace
Foraprivatevehicletrip,italsocaptures
Ø numberofpeopleinthevehicle,i.e.,vehicleoccupancy
Ø Drivercharacteristics(age,sex,workerstatus,educationlevel,etc.)
Ø Vehicleattributes(make,model,modelyear,amountofmilesdriveninayear)
Scope–WhattheNHTSIncludes
Ø Householddataonmembers,education,income
Ø Housingcharacteristics,informationoneachvehicle,includingyear,make,model,andestimates
ofannualmilestraveled
Ø Informationontravelaspartofwork;Dataaboutone-waytripstakenduringadesignated24-hour
period
Ø Informationtodescribecharacteristicsofgeographicalarea
Scope—WhatIsNotIncluded
Ø Costsoftravel
Ø Specifictravelroutesortypesofroadsused
P a g e 14|52
Ø Travelofthesampledhouseholdchangesovertime
Ø Informationthatwouldidentifytheexacthouseholdorworkplacelocation
APCAVL-Automaticpassengercounterandautomaticvehiclelocationdata
Location:Pittsburgh
Time:2012,2013,2014
Dataused:2013/2014
AVLAutomaticVehicleLocation(AVL)Systems,isapartofITS(intelligenttransportationsystem),
whichhasbeenadoptedbymanytransitagenciestotracktheirtransitvehiclesinrealtime.Itisessentially
avehicletrackingsystemthatcombinestheuseofautomaticvehiclelocationgatheredbyGPSattachedto
individualvehicleswithsoftwarethatcollectsafleetofdataforacomprehensivepicture.Thedatacollected
includesvehiclelocations,howlongtheytravel,wheretheystop,howlongittakestomovefromonestop
to another and the distance between the stops. Urban public transit authorities are an increasingly
commonuserofvehicletrackingsystems,particularlyinlargecitieslikePittsburgh.
AutomatedPassengerCounter (APC) on theotherhand is a device installedon transit vehicles,whichaccurately records boarding and alighting data. This device accurately tracks transit ridership when
compared to the traditionalmethods of accounting by drivers or estimation through surveying. These
devicesarebecomingmorecommonamongAmericantransitoperatorsseekingtoimprovetheaccuracy
ofreportingpatronageaswellasanalyzingtransitusepatternsbylinkingboardingandalightingdatawith
stoplocation.
AschematicdiagramforAVLAPCsystemisgivenbelow.
AUTOMATICVEHICLELOCATION
References:ntl.bts.gov
P a g e 15|52
AUTOMATICPASSENGERCOUNTER
References:Pinterest
Figure2Section1
AsampleofhowAPCAVLsystemcollectsdataisgivenbelowalongwiththevariabledescription.
P a g e 2|52
P a g e 1|52
Figure3section1
Variabledescription
Figure4section1
PittsburghParkingTerminals
Location:Pittsburgh
Time:2013and2014
Thedata(showninthesnapshotbelow)isforPittsburghdowntownparkingmeteroccupancycontainsall
DowntownPittsburghparkingterminalsidentifiedby"terminalID".Thedatafilegiventotheteamwasin
geoJSONformatandPPA_terminalscontainsparkingoccupancydataforentirePittsburgh
PittsburghparkingterminalsdatafileingeoJSONformat
Figure5section1
PittsburghParkingterminalconvertedtotableformat
P a g e 2|52
Figure6section1
INRIX
INRIXisaglobalSaaSandDaaScompany,whichprovidesavarietyofInternetservicesandmobile
applications pertaining to road traffic and driver services. INRIX provides historical, real-time traffic
information,trafficforecasts,traveltimes,traveltimepolygonsandtrafficcount.(References:Inrixwebsite
andWiki)
Wehadaccessto informationforPittsburghandall fields for INRIXdataforNorthbound,southbound,
eastbound,westbound,clockwiseandcounterclockwiseroadsinAlleghenycounty,PA(1930tmcs).The
variabledescriptionandsnapshotisprovidedinnextpage.
P a g e 3|52
SnapshotofINRIXdata
Figure7section1
P a g e 4|52
AppendixIII:DataCleaningSteps
TOOLSUSED:
Ø RStudio
Ø Weka
Ø Excel
APCAVL:Automaticpassengercounterandautomaticvehiclelocationdata
Availabledata:September2012,2013and2014.(RefertoFigure3and4inSection1).Thedatasetclearly
containsmanyvariables.However,lookingatitcloselytheteamrealizedthatitdoesnotcontainthemost
importantvariablesthatidentifythelocationofthebusstop,whichisthegeographiccoordinates.
Geographiccoordinatesactasthemissinglinkbetweenfindingthetractsthroughwhichthebus
transits. In order to find this data, we used the Pittsburgh GIS data from the Pittsburgh City website.
http://pittsburghpa.gov/dcp/gis/gis-data
Keyused:BusStopID
Thefollowingfileswereusedtoobtainthemissinginformationandcombinethedatasets
Ø CensusTracts2010:ContainsalldetailsaboutcensustractsinPittsburgh
Ø Neighborhoods:ContainsdetailsaboutallneighborhoodsinPittsburgh
Ø CensusBlocks2010:ContainsdetailsaboutallblocksandblockgroupsinPittsburgh
Ø Port Authority Bus Stops: Contains details about all bus stops and their respective stop IDs in
Pittsburgh
PittsburghDowntownparkingterminals
Available data: September 2013 and 2014. Refer to figure 6 section 1. Missing information about
neighborhoodandtracts.Datawasataverygranularlevel(TerminalIDlevelthatisageographicpoint).
Therefore,itneededtobescaleduptotractlevel.WeusedtheterminalIDscoordinateswhichactedas
themissinglinkbetweenfindingthetracts.Inordertofindthisdata,weusedthePittsburghGISdatafrom
thePittsburghcitywebsite.Keyusedtojointhedatasetiscoordinates.
http://pittsburghpa.gov/dcp/gis/gis-data
Thefollowingfileswereusedtoobtainthemissinginformationandcombinethedatasets
Ø CensusTracts2010:ContainsalldetailsaboutcensustractsinPittsburgh
Ø Neighborhoods:ContainsdetailsaboutallneighborhoodsinPittsburgh
Ø CensusBlocks2010:ContainsdetailsaboutallblocksandblockgroupsinPittsburgh
P a g e 5|52
Afterobtainingthedata,weidentified138majortractsinPittsburgh,isolatedtheparkingdataforallthese
138tracts,andaggregatedthenumberofparkingtransactionsineachtract.Thisdatawasveryrelevantin
creatingtheregressionmodelthatwillbediscussedlater.
INRIX
Availabledataset:September2011and2013.Refer(figure7section1).Thisdatasetconsistsofvehicle
speedingdatafordifferentTMCroadsegmentsanditsrespectivecoordinates.Again,themissingdatais
tractFIPScode.Weusedthecoordinateswhichactedasthemissing linkbetweenfindingthetracts. In
OrdertofindthisdataweusedthePittsburghGISdatafromthePittsburghcitywebsite.Keyusedtojoinis
coordinates.
http://pittsburghpa.gov/dcp/gis/gis-data
Thefollowingfileswereusedtoobtainthemissinginformationandcombinethedatasets
Ø CensusTracts2010:ContainsalldetailsaboutcensustractsinPittsburgh
Ø Neighborhoods:ContainsdetailsaboutallneighborhoodsinPittsburgh
Ø CensusBlocks2010:ContainsdetailsaboutallblocksandblockgroupsinPittsburgh
P a g e 6|52
AppendixIV:Models
LogisticRegression
Itisastatisticalmethodforanalyzingadatasetinwhichthereareoneormoreindependentvariablesthat
determineanoutcome.Theoutcomeismeasuredwithadichotomousvariable(inwhichthereareonly
twopossibleoutcomes).
AppliedinNHTSmodeltopredictmodeoftransporttakenbycommutertoreachworklocation.
MultinomialLogisticRegression
Itisthelinearregressionanalysistoconductwhenthedependentvariableisnominalwithmorethantwo
levels.Thus,itisanextensionoflogisticregression,whichanalyzesdichotomous(binary)dependents.
AppliedinACSmodeltopredictmodeoftransporttakenbycommutertoreachworklocation.
ArithmeticMean
Itissimply"mean")ofasample ,usuallydenotedby ,isthesumofthesampledvalues
dividedbythenumberofitemsinthesample:
SimpleLinearRegression
Itistheleastsquaresestimatorofalinearregressionmodelwithasingleexplanatoryvariable,fitsastraight
linethroughthesetofnpointsinsuchawaythatmakesthesumofsquaredresidualsofthemodel(that
is,verticaldistancesbetweenthepointsofthedatasetandthefittedline)assmallaspossible.
Theoutcomevariableisrelatedtoasinglepredictor.Theslopeofthefittedlineisequaltothecorrelation
betweenyandxcorrectedbytheratioofstandarddeviationsofthesevariables.Theinterceptofthefittedlineissuchthatitpassesthroughthecenterofmass(x,y)ofthedatapoints.
Appliedinmodelusedtopredicttimetakenbyacommutertoreachhisworklocation.
P a g e 7|52
AppendixV:Code
Codeusedtopredictmodeoftransport:Pittsburgh_Mode_Choice.R
(ApplicableforboththestepsofTask1)
library("nnet")
#Functionforimportcsvfile
import.csv<-function(filename){
return(read.csv(filename,sep=",",header=TRUE))
}
#Importdataset
pa<-import.csv('mergedPUMS2013.csv')
#ExtractonlyPittsburghCitydatabyPUMAcode
pit<-subset(pa,PUMA==01701|PUMA==01702)
#Createanemptydataframeforlateruse
final<-data.frame()
#RandomSampling
for(iin1:1000){
#Shuffle
pit<-pit[sample(nrow(pit)),]
#20%
#pit.20<-pit[1:189,]
#40%
#pit.40<-pit[1:377,]
#60%
#pit.60<-pit[1:566,]
#80%
#pit.80<-pit[1:754,]
#100%
P a g e 8|52
pit.100<-pit
#Combinetheresults
final<-rbind(final,results(pit.100))
}
#Replacemissingvalues
final[is.na(final)]<-0
#Calulatethemeanaccuracyasfinalresults
onethousandtimes.hundredpercent<-data.frame(Car=mean(final$Car),
Bus=mean(final$Bus),
StreetCar=mean(final$Streetcar),
Subway=mean(final$Subway),
Railroad=mean(final$Railroad),
Ferryboat=mean(final$Ferryboat),
Taxicab=mean(final$Taxicab),
Motorcycle=mean(final$Motorcycle),
Bicycle=mean(final$Bicycle),
Walked=mean(final$Walked),
Worked.at.home=mean(final$Worked.at.home),
Other.method=mean(final$Other.method)
)
#Functionforthemodel
results<-function(df){
#Variables
variables <-
c("JWMNP","FMRGIP","MIG","PINCP","COW","JWRIP","INTP","FLANXP","PWGTP","SPORDER","PERNP","HI
NS2","TYPE","FS","RACASN","RNTM","JWTR")
#Subsetthedatasetbyprovidedvariables
df<-df[variables]
#Dataprocessing
P a g e 9|52
df<-subset(df,JWTR!="")
df[is.na(df)]<-0
df$FMRGIP<-as.factor(df$FMRGIP)
df$MIG<-as.factor(df$MIG)
df$COW<-as.factor(df$COW)
df$JWRIP<-as.factor(df$JWRIP)
df$FLANXP<-as.factor(df$FLANXP)
df$HINS2<-as.factor(df$HINS2)
df$TYPE<-as.factor(df$TYPE)
df$FS<-as.factor(df$FS)
df$RACASN<-as.factor(df$RACASN)
df$RNTM<-as.factor(df$RNTM)
df$JWTR<-as.factor(df$JWTR)
private<-subset(df,JWTR=="1")
public<-subset(df,JWTR!="1")
#Functionfor10folderscrossvalidation
smoothCV<-function(x,y,K=10){
result<-data.frame()
x<-c(x)
y<-c(y)
df<-data.frame(x,y)
df<-df[sample(nrow(df)),]
folds=cut(1:nrow(df),breaks=K,labels=FALSE)
for(jin1:K){
train<-df[folds!=j,]
test<-df[folds==j,]
fit<-multinom(JWTR~.,data=train,MaxNWts=8268)
predicted2<-predict(fit,test,type="class")
P a g e 10|52
result<-rbind(result,data.frame(pred=predicted2,true=test$JWTR))
}
return(result)
}
#Runcrossvalidation
x<-df[,names(df)!="JWTR"]
y<-data.frame(df$JWTR)
colnames(y)<-c("JWTR")
result<-smoothCV(x,y,10)
#Createvariablesforlateruse
Car<-0
Bus<-0
Streetcar<-0
Subway<-0
Railroad<-0
Ferryboat<-0
Taxicab<-0
Motorcycle<-0
Bicycle<-0
Walked<-0
Worked.at.home<-0
Other.method<-0
#Accumulatetheoccurrencesofeachtransportationtool
for(iin1:nrow(result)){
if(result$true[i]=="1"&&result$pred[i]=="1"){
Car<-Car+1
}elseif(result$true[i]=="2"&&result$pred[i]=="2"){
Bus<-Bus+1
}elseif(result$true[i]=="3"&&result$pred[i]=="3"){
P a g e 11|52
Streetcar<-Streetcar+1
}elseif(result$true[i]=="4"&&result$pred[i]=="4"){
Subway<-Subway+1
}elseif(result$true[i]=="5"&&result$pred[i]=="5"){
Railroad<-Railroad+1
}elseif(result$true[i]=="6"&&result$pred[i]=="6"){
Ferryboat<-Ferryboat+1
}elseif(result$true[i]=="7"&&result$pred[i]=="7"){
Taxicab<-Taxicab+1
}elseif(result$true[i]=="8"&&result$pred[i]=="8"){
Motorcycle<-Motorcycle+1
}elseif(result$true[i]=="9"&&result$pred[i]=="9"){
Bicycle<-Bicycle+1
}elseif(result$true[i]=="10"&&result$pred[i]=="10"){
Walked<-Walked+1
}elseif(result$true[i]=="11"&&result$pred[i]=="11"){
Worked.at.home<-Worked.at.home+1
}elseif(result$true[i]=="12"&&result$pred[i]=="12"){
Other.method<-Other.method+1
}
}
#Calculatetheaccuracy
car<-subset(df,JWTR=="1")
car.accurary<-Car/nrow(car)
bus<-subset(df,JWTR=="2")
bus.accurary<-Bus/nrow(bus)
streetcar<-subset(df,JWTR=="3")
streetcar.accurary<-Streetcar/nrow(streetcar)
subway<-subset(df,JWTR=="4")
P a g e 12|52
subway.accurary<-Subway/nrow(subway)
railroad<-subset(df,JWTR=="5")
railroad.accurary<-Railroad/nrow(railroad)
ferryboat<-subset(df,JWTR=="6")
ferryboat.accurary<-Ferryboat/nrow(ferryboat)
taxicab<-subset(df,JWTR=="7")
taxicab.accurary<-Taxicab/nrow(taxicab)
motorcycle<-subset(df,JWTR=="8")
motorcycle.accurary<-Motorcycle/nrow(motorcycle)
bicycle<-subset(df,JWTR=="9")
bicycle.accurary<-Bicycle/nrow(bicycle)
walked<-subset(df,JWTR=="10")
walked.accurary<-Walked/nrow(walked)
worked.at.home<-subset(df,JWTR=="11")
worked.at.home.accurary<-Worked.at.home/nrow(worked.at.home)
other.method<-subset(df,JWTR=="12")
other.method.accurary<-Other.method/nrow(other.method)
accu<-data.frame(Car=car.accurary,
Bus=bus.accurary,
Streetcar=streetcar.accurary,
Subway=subway.accurary,
Railroad=railroad.accurary,
Ferryboat=ferryboat.accurary,
Taxicab=taxicab.accurary,
Motorcycle=motorcycle.accurary,
Bicycle=bicycle.accurary,
Walked=walked.accurary,
Worked.at.home=worked.at.home.accurary,
Other.method=other.method.accurary
P a g e 13|52
)
return(accu)
}
CodeusedtogettheFIPScodeofstopfromtheCensusAPI:getStopFIPS.py
importre
importrequests
frombs4importBeautifulSoup,Tag
importcsv
defmain():
a=1
map={}
withopen('stopIDFIPS','w',newline='')asfp:
a=csv.writer(fp,delimiter=',')
title=["stopID","FIPS"]
a.writerow(title)
withopen('PAAC_Stops_1603_Public2.csv','r')ascsvfile:
readCSV=csv.reader(csvfile,delimiter=',')
forrowinreadCSV:
ifa==1:
a=2
continue
all=row[4]+"+"+row[5]
value=map.get(all)
ifvalue==None:
url="http://data.fcc.gov/api/block/find?format=json&latitude="+row[4]+"&longitude="+row[5]+"
&showall=true"
html=requests.get(url).content
htmltxt=BeautifulSoup(html,'html.parser')
matchobj=re.findall(r'{"Block":{"FIPS":"(\d+)"},"County":',str(htmltxt))
iflen(matchobj)>0:
map[all]=matchobj[0]
value=matchobj[0]
ifvalue!=None:
data=[row[0],value]
a.writerow(data)
main()
P a g e 14|52
Codeusedtofiltermeantraveltimefromtractleveltocountylevel:filter.py
importre
importrequests
frombs4importBeautifulSoup,Tag
importcsv
defmain():
a=1
begin=set()
map={}
PUMAMap={}
withopen('filteredMeanTT.csv','w',newline='')asfp:
a=csv.writer(fp,delimiter=',')
title=["PUMAtract","meanTT"]
a.writerow(title)
withopen('NeighborhoodPuma.csv','r')ascsvfile:
readCSV=csv.reader(csvfile,delimiter=',')
forrowinreadCSV:
ifa==1:
a=2
continue
puma=row[0]
tract=row[2][0:11]
begin.add(tract)
PUMAMap[tract]=puma
withopen('FIPSMeanTT.csv','r')asinput:
readInput=csv.reader(input,delimiter=',')
forrowinreadInput:
try:
FIPS=row[0]
time=float(row[1])
FIPSstart=FIPS[0:11]
except:
continue
ifFIPSstartinbegin:
pumaValue=PUMAMap.get(FIPSstart)
value=map.get(pumaValue)
ifvalue==None:
map[pumaValue]=[time,1]
else:
map[pumaValue]=[value[0]+time,value[1]+1]
forkey,valueinmap.items():
meanTT=value[0]/value[1]
data=[key,meanTT]
a.writerow(data)
P a g e 15|52
main()
Codeusedtocalculatethemeantime:calculateMeanTT.py
importre
importrequests
importcsv
fromosimportlistdir
fromos.pathimportisfile,join
fromcollectionsimportdeque
defmain():
countmiss=0
countsucc=0
FIPSMap={}
trackMap={}
d=deque()
countMissOff=0
withopen('FIPSMeanTT.csv','w',newline='')asfp:
a=csv.writer(fp,delimiter=',')
title=["FIPS","meanTT"]
a.writerow(title)
withopen('stopIDFIPS.csv','r',newline='')ascsvfile1:
readCSV1=csv.reader(csvfile1,delimiter=',')
forrowinreadCSV1:
FIPSMap[row[0]]=row[1]
mypath='/Users/Yun/Documents/capstone/findTrackPublicMeanTT/StopInfo'
onlyfiles=[fforfinlistdir(mypath)ifisfile(join(mypath,f))]
forfileinonlyfiles:
iffile.endswith(".csv"):
withopen(mypath+"/"+file,'r')ascsvfile:
readCSV=csv.reader(csvfile,delimiter=',')
forrowinreadCSV:
iflen(row)>20:
ifrow[0]!='DOW':
try:
stopID=row[8].strip()
on=int(row[16])
off=int(row[17])
min=float(row[20])
exceptException:
continue
FIPS=FIPSMap.get(stopID)
P a g e 16|52
ifFIPS==None:
countmiss=countmiss+1
continue
else:
countsucc=countsucc+1
foriind:
trackInfo=trackMap.get(i)
iftrackInfo==None:
trackMap[i]=[min,0]
else:
iflen(trackInfo)==2:
trackMap[i]=[trackInfo[0]+min,trackInfo[1]]
trackInfoCur=trackMap.get(FIPS)
iftrackInfoCur==None:
trackMap[FIPS]=[0,0]
trackInfoCur=[0,0]
trackMap[FIPS]=[trackInfoCur[0],trackInfoCur[1]+on]
whileon>0:
d.append(FIPS)
on=on-1
whileoff>0:
iflen(d)==0:
countMissOff=countMissOff+off
break
else:
d.popleft()
off=off-1
whilecountMissOff>0:
iflen(d)>0:
d.popleft()
countMissOff=countMissOff-1
else:
break
forkey,valueintrackMap.items():
ifvalue[1]==0:
continue
else:
meanTT=value[0]/value[1]
print(key,value)
data=[key,meanTT]
a.writerow(data)
print(countmiss,countsucc)
main()
P a g e 17|52
Codeusedtopredictthetravel-time:travel_time.R
Lm (formula = X$Avg_TT ~X$Public_MTT + X$Parking_Transactions + X$Population_Total +
X$Means_of_Transport_Total,data=X)
P a g e 18|52
References
Ø NationalHouseholdTravelSurveyDataAnalysis:Ø U.S.CensusBureauprojectionsto2060methodology:
http://www.census.gov/population/projections/data/national/2014.htmlØ U.S.CensusBureauprojectionsto2060summarytables:
http://www.census.gov/population/projections/data/national/2014/summarytables.htmlØ SummaryreportabouttheU.S.CensusBureauprojectionsto2060:
http://www.census.gov/content/dam/Census/library/publications/2015/demo/p25-1143.pdfØ ACSLongitudinalEmployer-HouseholdDynamics:Job-to-JobFlows(J2J)Data(Beta)DataFiles:
http://lehd.ces.census.gov/data/j2j_beta.htmlØ ACSLongitudinalEmployer-HouseholdDynamics:Job-to-JobFlows(J2J)Data(Beta)Data
dictionary:http://lehd.ces.census.gov/data/schema/V4.1c-draft/lehd_public_use_schema.html
Ø MeasuringCommutingintheAmericanTimeUseSurvey:http://bae.uncg.edu/econ/files/2015/02/Kimbrough-working-paper-15-02.pdf
Ø OhioTIMES:transportationinformationmappingsystemhttp://gis.dot.state.oh.us/tims/Ø 2011OregonHouseholdActivitySurvey:
http://www.oregon.gov/ODOT/TD/TP/Pages/Data.aspxØ TheImpactsofSocio-EconomicandDemographicShiftsinTransitServedNeighborhoodsOn
ModeChoiceAndEquitybyStevenApellØ DemographicTrendsTheEffectsofDemographicChangeonSelectedTransportationServicesand
DemandbyMurdockSH,ClineME,ZeyM,PerezD,&JeantyPWØ ModeChoicesofMillennials:HowDifferent?HowEnduring?byRobertB.Case,PE,PhD,Seth
SchipinskiØ TravelTimeUseOverFiveDecadesbyChenSong&ChaoWeiØ NortheastTravelChoiceSurvey(NTCS)UniversityofVermont’sTransportationResearchCenter
(datanotpublic)Ø 2015UrbanMobilityScorecardpublishedjointlybytheTexasA&MTransportationInstituteand
INRIXØ INRIXInterfaceGuideDecember2014publishedbyINRIXØ ImpactofTrafficCongestiononBusTravelTimeinNorthernNewJerseybyClaireE.McKnight,
HerbertS.Levinson,KaanOzbay,CamilleKamga,andRobertE.PaaswellØ CommutinginAmericaReportIII-TheThirdNationalReportonCommutingPatternsandTrends
byAlanE.PisarskiØ RStudioTool:https://www.rstudio.com/products/rstudio/download/Ø WekaTool:https://weka.waikato.ac.nz/explorer Ø NationalHouseholdTravelSurveyData:Ø INRIXdata:Heinz College,Carnegie Mellon University Research CentreØ PittsburghDowntownParkingTerminals:Heinz College, Carnegie Mellon University
Research CentreØ Automaticpassengercounterandautomaticvehiclelocationdata:Heinz College, Carnegie
Mellon University Research Centre
P a g e 19|52
Ø PittsburghGISinformation:http://pittsburghpa.gov/dcp/gis/gis-dataØ DescriptionofStatisticalModels:Wikipedia and www.statisticssolutions.comØ GoodnessofFitexplanationforstatisticalmodel:
http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-621/fit