capstone project - u.s. census bureau, center for adaptive...

CAD11-30-15

CapstoneProject-U.S.CensusBureau,CenterforAdaptiveDesignTitleoftheProject:CombiningTraditionalandNewDataSourcesintheProductionofOfficialStatisticsTiming:Spring2016Client:TheCenterforAdaptiveDesignattheU.S.CensusBureauIssueDefinition:TheU.S.CensusBureauisthelargeststatisticalagencyintheU.S.Government.Inadditiontotheonce-a-decadepopulationcount,theCensusBureaucollectsdataontheU.S.economy,healthcare,poverty,education,crime,andbusinessinnovation,amongothers.ThepredominantmethodusedforcollectingdatabytheCensusBureauisthetraditionalsamplesurveyadministeredtohouseholds,businessesandgovernmententities.Responseratestosamplesurveys,however,aredecliningrapidlywhilesurveycostsareclimbingatanunsustainablerate.Further,theCensusBureauishighlyinterestedinreducingpublicburden(theeffortandtimespentrespondingtosurveys).ThiscombinationofchallengesisdrivingtheCensusBureautoexplorealternativedatacollectionmethods,includingpassiveapproachesusingBigDatatechniques.Theoverallissuethisprojectwillhelpaddressishowtocombinethedatafromnewsourceswithdatacollectedusingthetraditionalapproaches.Thegoalistoincreaseefficiencyandtimeliness,reducecostandburden,andmaintaindataquality.ScopeofWork:Aparticularlineofsurveyquestionsthathaveproventobeproblematicarethosesurroundingthetopicofpeople’stravelandcommutinghabits.Withtheavailabilityoftrafficandcommutingdatafrompublicandprivatesources,aswellascrowd-sourcedapplications(suchasWaze)theCensusBureaushouldbeabletoaccessdatathatmayenableareductionintravelandcommutingquestionsinoursamplesurveys(suchastheAmericanCommunitySurvey).ThisprojectwouldresearchpossibledatasourcestoinferselectedtravelandcommutingstatisticsintheAmericanCommunitySurvey,suggestmethodsandtacticsforaccessingandinterpretingthosesources,andexploreapproachesforverifyingthequalityofthesesourceswiththegoalovertimeofincorporatingthemintoofficialstatistics.Possibledatasourcesincluderoadwaytrafficcounts,traveltime,parking,publictransportation,etc.Workalsoinvolvesdevisingmethodstotestnewdatasourcequalityagainstestablishedsurveydatabenchmarks,aswellassuggestingaviablelong-termplanfortransitioningtonewdatasources.ExpectedDeliverables:

• Researchandrecommendationsonpossiblenewdatasourcesto

complement/replacetargetedquestionsonexistingsamplesurveyssuchastheAmericanCommunitySurvey

CAD11-30-15

• Literaturereviewofexistingworkonincorporatingnewdatasourcesintotravelsurvey

• Documentationofmethods,technology,cost-benefits,andtacticsforaccessingandinterpretingnewdatasources

• Suggestedapproachesforverifyingthequalityofnewdatasources• Documentationofapossiblelong-termplanfortransitioningtonewdatasources

SkillsRequired:DataAnalytics/DataScienceSurveyMethodologyInformationTechnology/SystemDevelopmentAdvisoryBoard:MichaelThieme,ChiefoftheCenterforAdaptiveDesign,U.S.CensusBureauBenjaminReist,AssistantCenterChief,U.S.CensusBureau,CenterforAdaptiveDesign,U.S.CensusBureauStephanieCoffey,MathematicalStatistician,CenterforAdaptiveDesign,U.S.CensusBureauDeadline:12/5/2015FacultyAdvisor:SeanQianProposercontactinfo:Name,email,phone#Clientpointofcontact:[email protected](301)763-9062

CapstoneReport

Spring2016

Advisor

SeanQian

Team

ManikandanPalani|NishaRao|SahilAggarwal|YifeiJiang|YunFu

P a g e 1|52

TableofContents

ExecutiveSummary......................................................................................................................................3

Introduction.................................................................................................................................................3

Background..............................................................................................................................................3

Goal..........................................................................................................................................................4

ProjectMethodology...................................................................................................................................4

Task1:ACSquestion“Howdidthispersonusuallygettoworklastweek?”...........................................5

Task2:ACSquestion“Howmanyminutesdiditusuallytakethispersontogetfromhometoworklastweek?”...................................................................................................................................................10

LessonsLearned...........................................................................................................................................7

SuggestionsforFutureWork........................................................................................................................8

Acknowledgement.......................................................................................................................................8

AppendixI:LiteratureReview......................................................................................................................8

NationalHighwayTravelSurveyDataAnalysis........................................................................................8

FactorsaffectingtraveltimeinUS...........................................................................................................9

BusTransitTime.....................................................................................................................................10

INRIXTrafficData...................................................................................................................................11

UrbanCongestionLevel.........................................................................................................................11

AppendixII:InformationonExternalDataSources...................................................................................13

NHTS:NationalHouseholdtravelsurvey...............................................................................................13

APCAVL-Automaticpassengercounterandautomaticvehiclelocationdata.....................................14

PittsburghParkingTerminals...................................................................................................................1

INRIX.........................................................................................................................................................2

AppendixIII:DataCleaningSteps.................................................................................................................4

APCAVL:Automaticpassengercounterandautomaticvehiclelocationdata........................................4

PittsburghDowntownparkingterminals.................................................................................................4

INRIX.........................................................................................................................................................5

AppendixIV:Models....................................................................................................................................6

LogisticRegression...................................................................................................................................6

MultinomialLogisticRegression...............................................................................................................6

ArithmeticMean......................................................................................................................................6

SimpleLinearRegression.........................................................................................................................6

P a g e 2|52

AppendixV:Code.........................................................................................................................................7

Codeusedtopredictmodeoftransport:Pittsburgh_Mode_Choice.R....................................................7

CodeusedtogettheFIPScodeofstopfromtheCensusAPI:getStopFIPS.py......................................13

Codeusedtofiltermeantraveltimefromtractleveltocountylevel:filter.py....................................14

Codeusedtocalculatethemeantime:calculateMeanTT.py................................................................15

Codeusedtopredictthetravel-time:travel_time.R.............................................................................17

References..................................................................................................................................................18

P a g e 3|52

ExecutiveSummary

This report is oneof the deliverables associatedwith theHeinz CapstoneProject programat Carnegie

MellonUniversity.Inthisprogram,studentsaregroupedinteamsoffiveandassignedaclientforwhom

theywillundertakeaninitiativeoverthecourseofasemester.

OurclientforthisprojectistheCentreforAdaptiveDesignoftheUnitedStatesCensusBureau.TheU.S

CensusBureauistheprincipleU.SFederalStatisticalsystemintheU.S.

OurgoalwastohelpreducethecostofconductingtheAmericanCommunitySurvey.Thisreportisdivided

intoseveralsectionsthatdetailthestepstakenbytheteamtodevelopasolutionthathelpsachievethe

goal.Firstly,itprovidesabriefbackgroundontheteam’objective.Secondly,itdetailsthemodelbuiltby

CMU Capstone team to estimate response to two of the fivework-commute related questions in the

American Community Survey (ACS) using historical ACS data and data from external sources – INRIX,

PittsburghParking.Theteamcomparedtheaccuracyoftheresultsofthemodelusinga20%,40%,60%

and80%samplingplan toobserve theutilityof themodel in scenarioswhereonly20%or40%of the

existingrespondentsweretobecontacted.ThemodelcouldhelptheBureautoreducethenumberof

peoplesurveyedyetobtainreliablyaccurateresponsesfortherestoftheintendedtargetpopulationand

subsequentlyreducesurveycosts.Thirdly,thereportdetailsthefutureworkofincludingfactorssuchas

congestionandweatherthatcouldhelpfurtherrefinethemodel.

Introduction

Background

US Census Bureau conducts the American Community Survey (ACS) on an ongoing basis to collect

information about the American people and their community. This data is used by planners and

entrepreneurstoassessthepastandplanthefuture.Italsohelpsdeterminethedistributionofmorethan

400$billioninfederalandstatefundseachyear.

Thesurveyisadministeredthroughseveralwaystooneinthirty-eightU.Shouseholdsperyear.Theform

can be completed online or on a paper form.U.S Census Bureau has a follow-up procedure in case a

respondentdoesnotrespondwithinastipulatedtimeframe.Thenon-respondenteitherreceivesacallor

apersonalvisitfromtheCensusstafftohelpcompletethesurvey.

Thecostofconductingthesurveyisincreasingwhilethesurveyresponserateisdeclining.ACScurrently

has seventy-seven questions. To reduce respondent burden, Centre for Adaptive Design of US Census

Bureauisresearchingwaystoreducesurveycost.

Ofitsseventy-sevenquestions,ACShasfivequestionsthatcaptureinformationaboutwork-commuteof

therespondents.Thesequestionshaveprovedtobeveryproblematicwhenitcomestoelicitingresponses.

P a g e 4|52

Goal

UndertheguidanceofSeanQian,CMUProfessor,ourgoalofthisprojectwastoaccesswaystoinferthe

travelandcommuterelatedquestionsinACS.

ProjectMethodology

Ofthefivework-commuterelatedquestionsinACS,theteamconcentratedontwoquestions-“Howdid

thispersonusuallygettoworklastweek?”and“Howmanyminutesdiditusuallytakethispersontoget

from home to work last week?”. The objective was to build models to estimate responses to these

questions.

So,theprojectmethodologyhasbeendetailedwithrespecttothetwotasks.

Task1:ACSquestion-“Howdidthispersonusuallygettoworklastweek?”

Summary: The team through its initial research (Appendix I: NHTS Data Analysis) identified the

methodologytodevelopamodeltopredictthemodeoftransportaspublicorprivate.Theexplanatory

variablesforthemodelwereinitiallyidentifiedfrombothNHTSandACSdatasources.

NHTSdataisdetailedandcananswermanyquestionsdirectlybutitslatestdataisfromyear2009.

NHTSisnotconductedasfrequentlyasACS.Duetothisfact,theutilityofthemodelwaslimited.More

detailsareprovidedin"StepI:PredictivemodelusingNationalHouseholdTravelSurveydata"

TheteamthenidentifiedequivalentofNHTSvariablesinACSdata.Theteammembersdeveloped

amodeltopredictthemodeoftransportaspublicorprivateusingsolelythevariablesfromACSdata.They

refinedthemodelfurthertobeabletoidentifythetypeofpublicorprivatemodeoftransporttakenbya

respondent.Theutilityofthismodelwastestedusing20%,40%,60%and80%randomsamplingplan.More

detailsareprovidedin"StepII:PredictivemodelusingPUMSdataofAmericanCommunitySurvey"

Task2:ACSquestion-“Howmanyminutesdiditusuallytakethispersontogetfromhometoworklast

week?”

Summary:UsingtheCommutinginAmericaIIIreport(AppendixII:FactorsaffectingtraveltimeinUS),the

team identified intuitively the variables that could affect the work-commute travel time. The team

membersalsoresearched(AppendixIII:BusTransitTimeandAppendixIV:INRIXTrafficData)onhowto

makeuseoftheexternaldatasourcessuchasBustransitdataandINRIXtrafficspeeddatainthemodelto

predicttravel-time.

The team faced challengeswith respect to data granularity and overcame themby using data

cleaningmethodsandaggregatingthedataattheappropriatelevel.Theteamalsohadtomakequiteafew

assumptionsinordertomakeuseoftheexternaldatasets.Themembersdevelopedamethodtocalculate

meantraveltimeforbustransitatblocklevel.Moredetailsareprovidedin"StepI:Meantraveltimefor

Publictransportation-Businagivenblock".

Thedetailsof the travel-timepredictingmodel areprovided in "Step II: Predictivemodelusing

historicaldatafromACSalongwithinputsfromexternaldataset".

P a g e 5|52

Theteamalsodidsomeresearchoncalculatingthecongestionlevelinurbanareas(AppendixV:

Urban Congestion Level). As per the “Commuting in America III report”, travel time is an attribute of

commuters whereas congestion is an attribute of facilities. Therefore, measures of travel times and

measuresofcongestiondonotnecessarilyconverge.Hencethecongestionfactorhasnotbeendelvedinto

detailandcurrentlynotbeenincorporatedinthetravel-timemodel.

Thedetailsofeachstepineachtaskaredescribedbelow.

Task1:ACSquestion“Howdidthispersonusuallygettoworklastweek?”

ResponseoptionsinACS:

• Car,truck,orvan

• Busortrolleybus

• Streetcarortrolleycar

• Subwayorelevated

• Railroad

• Ferryboat

• Taxicab

• Motorcycle

• Bicycle

• Walked

• Workedathome

• Othermethod

StepI:PredictivemodelusingNationalHouseholdTravelSurveydata

Summary:Thismodelwasbuilttoinitiallyclassifythepopulationintermsofwhetheranindividualuseda

publicoraprivatemodeoftransporttocommutetowork,andoncethatmodelshowedpromise,todig

deeperintothepredictionbybeingabletoalsoidentifywhichparticularmodeoftransportwasused(for

e.g.car,bicycleetc.inprivatevehiclesandbus,trainetc.inpublicmodesoftransport).

Tool:RStudio

InputData:

Source:NationalHouseholdTravelSurvey

Details:

Year Level #ofvariables #ofrecords

2001 National 142 70k

2009 National 150 75k

Data Cleaning Step: Some of the input variables were non-binary categorical variables. The logistic

regression model cannot accept non-binary categorical variables. These variables were converted to

relevantdummyvariablesforuseinthepredictiveprocess.Fore.g.Thevariablefor‘Race’cantakeupto9

differentvalues.Soatleast8dummyvariableswereneededforeachvaluethatthevariablecouldassume.

Aftertakingthedummyvariablesintoaccount,wehadatotalofaround160variables.Thevariablesthat

didnotrealisticallyholdanypromisewereweededouteitherduetotheir inherentdefinitionordueto

their infrequency in the actual data. After removing the variables that had almost zero frequency of

occurrence,theshortlistedvariablestotaledinthe60-80range.

P a g e 6|52

ExplanatoryVariables:

Ø Age

Ø Sex

Ø Race

Ø Region

Ø Urban housing

status

Ø Block housing

status

Ø MSAsize

Ø Educationlevel

Ø Hispanicindicatorpopulation

density

Ø Totalincome

Ø Householdincome

Ø Renterpercentage

Ø Workerstatus

Ø Numberofvehicles

Ø Homeownership

status

Ø Placeofbirth

Selectionlogic:DetailsareprovidedinAppendixI:NHTSDataAnalysis

PredictedVariable:Modeoftransport–PublicorPrivate

Model:GLMbasedlogisticclassifier

ln #$1 − #$

= )* + ),-,,$ + )/-/,$ + ⋯+)1-1,$

whereln #$1 − #$

= logoddsofprivatemodeoftransportused

-,,$ =predictorvariablei

Code: “Details in Appendix V under section “Code used to predict mode of transport:

Pittsburgh_Mode_Choice.R”

Output:

Ø 93%predictionaccuracyfor2001NHTSdataat0.5cutoffasadecisionrule

Ø 95%predictionaccuracyfor2009NHTSdataat0.5cutoffasadecisionrule

Validation:10-foldcrossvalidationmethodwasusedtovalidatetheresults.Ithadanaverageaccuracyrate

of93%.

Challenge:

Ø Highlyskewednatureofdata-92%peopleuseprivatemodeoftransport

Ø Trade-offbetweenaccuracyandthenumberoffalsepositivesfortweakingthedecisionruleand

increasingitbeyond0.5

Theoverall rateofpublic transportuse in theNHTSsamplewasaround7%whichwasroughlyour

model’smisclassificationrate.Thishappenedduetotheinherentlyskewednatureofthedata,where~93%

ofthepopulationusedaprivateformoftransporttocommutetoworkwhereasonly7%usedapublic

modeoftransport.Furthermore,ourmodel’sROCcurveshowed26%greaterrecallthanaflipping-of-coin

basedmodel.

Conclusion: The most important and most correlated variables (with the mode of transport used to

commutetowork)areAge,Sex,EducationlevelandTotalIncome-allofwhichmadeperfectintuitivesense.

P a g e 7|52

TheNational Household Travel Survey data can be used to estimate cannot be used as an alternative

source.

StepII:PredictivemodelusingPUMSdataofAmericanCommunitySurvey

Summaryofthemodel:Thismodelwasbuilttoclassifythepopulationintermsoftheparticularmodeof

transport(fore.g.car,bicycle,train,etc.)usedtocommutetowork.

Tool:RStudio,AmazonWebService

InputData:AmericanCommunitySurvey’shistoricaldata

TheAmericanCommunitySurvey (ACS)PublicUseMicrodataSample (PUMS) filesarea setof

untabulatedrecordsaboutindividualpeopleorhousingunits.TheCensusBureauproducesthe

PUMSfilessothatdatauserscancreatecustomtablesthatarenotavailablethroughpretabulated

(orsummary)ACSdataproducts.

DataCleaningStep:

Ø TheteamextractedthePittsburghCitydatafromthewholenationalwidedatasetbasedonthe

PUMAcodeofPittsburghCity(01701and01702).

Ø Forthemissingvalues,wereplacedthemwith0or01accordingtocertaincircumstances.

ExplanatoryVariables:

Thereare512variablesinthedataset,theteamthenselectedthevariablesthatmostinfluence

whichtransportmodethepersonismostlikelytouseforbothPittsburghCityandU.S.national

wide.

VariablesforPittsburghCity:

Name Explanation

JWMNP Traveltimetowork

FMRGIP Firstmortgagepaymentincludesfire,hazard,floodinsuranceallocationflag

MIG Mobilitystatus

PINCP Totalperson'sincome

COW Classofworker

JWRIP Vehicleoccupancy

INTP Interest,dividends,andnetrentalincomepast12months

FLANXP LanguageotherthanEnglishallocationflag

PWGTP Person'sweight

SPORDER PersonNumber

PERNP Totalperson'searnings

HINS2 Insurancepurchaseddirectlyfromaninsurancecompany

P a g e 8|52

TYPE Typeofunit

FS Yearlyfoodstamprecipiency

RACASN Asianrecode(Asianaloneorincombinationwithoneormoreotherraces)

RNTM Mealsincludedinrent

JWTR(Target) Meansoftransportationtowork

VariablesforU.S.NationalWide:

Name Explanation

JWMNP Traveltimetowork

POWSP Placeofwork-Stateorforeigncountryrecode

HINS7 IndianHealthService

Wgtp HousingWeightreplicate3

REFR Refrigerator

RACBLK BlackorAfricanAmericanrecode(Blackaloneorincombinationwithoneormoreother

races)

MHP Mobilehomecosts(yearlyamount)

SSP SocialSecurityincomepast12months

PUMA Publicusemicrodataareacode(PUMA)

OCCP Occupationrecode

POWPUMA PlaceofworkPUMA

FINDP Industryallocationflag

SMOCP Selectedmonthlyownercosts

HINCP Householdincome(past12months)

OCPIP Selectedmonthlyownercostsasapercentageofhouseholdincomeduringthepast12

months

JWTR Meansoftransportationtowork

SelectionLogic:Sameastheonementionedintheprevioussection.

PredictedVariable:Modeoftransport–Car,Bus,Streetcar,Subway,Railroad,Ferryboat,Taxicab,

Motorcycle,Bicycle,Walked,Workathomeorothermethod

Model:

MultinomialLogisticRegressionisthelinearregressionanalysistoconductwhenthedependentvariableis

nominal with more than two levels. Thus it is an extension of logistic regression, which analyzes

dichotomous(binary)dependents.

P a g e 9|52

Also,inordertopreventtheoverfittingproblem,weconducted10-foldercrossvalidation.Cross-validation

isamodelvalidationtechniqueforassessinghowtheresultsofastatisticalanalysiswillgeneralizetoan

independentdataset.Itismainlyusedinsettingswherethegoalisprediction,andonewantstoestimate

howaccuratelyapredictivemodelwillperforminpractice.Inapredictionproblem,amodelisusuallygiven

adatasetofknowndataonwhichtrainingisrun(trainingdataset),andadatasetofunknowndata(orfirst

seendata)againstwhichthemodelistested(testingdataset).Thegoalofcrossvalidationistodefinea

datasetto"test"themodelinthetrainingphase(i.e.,thevalidationdataset),inordertolimitproblems,

likeoverfitting,giveaninsightonhowthemodelwillgeneralizetoanindependentdataset.

Code:

DetailsinAppendixVundersection“Codeusedtopredictmodeoftransport:

Pittsburgh_Mode_Choice.R”

Output:

PittsburghCity

U.S.NationalWide

Validation:

Duetothehighlyskewedoriginaldataset(overall95%ofthehouseholdsusecartocommute),theteam

hastorebalancethedataset.Wespecificallyextractedhalfofthedataascommuterswhousecarandthe

resthalfascommuterswhodon'tusecar.

ForPittsburghCity,theteamthenrandomlysubsetthedatasetintosamplesof20%,40%,60%and80%

and ran simulation 1000 times to compare the accuracy. However, for U.S. National wide, even if we

deployedAWSandlaunchedthelargestinstance(x10.Large)onit,westillcouldn'trunthewholenational

dataset.Soalternatively,werandomlysubsetthedatasetintosamplesof10%,20%,30%,40%and50%,

andransimulationonly1timestocomparetheaccuracy.

P a g e 10|52

Challenge:Largevolumeofdata

Conclusion:

ForPittsburghCitydata,theaverageaccuracyafter1000timessimulationrunsshowsthatpartial

samplingcanleadtoroughlylesseraccuracy.However,forU.S.NationalWidedata,wecannot

seeaclearpattern.Webelievethatthereasonbehindofthisphenomenonisthelackofsimulation

times.Giventhebudgetandtimerestriction,werealizedthatthereisalimitofcomputerpower.

Task2:ACSquestion“Howmanyminutesdiditusuallytakethispersontogetfromhome

toworklastweek?”

StepI:MeantraveltimeforPublictransportation-Businagivenblock

Summary:Usethearithmeticmeanformulatocalculatethemeantraveltime

Tool:ExcelandPython

Input:APCAVL(Automaticpassengercounterandautomaticvehiclelocation)data

Year Level Location

September2013 City Pittsburgh

DataCleaningStep:Eliminatemissingvaluesbyfilteringoutthoserecordsfromtheinput.

Inputvariables:

Ø PeoplegettingonthebusatastopØ PeoplegettingoffthebusatastopØ CurrentloadinthebusØ StopIDØ BlockIDorFIPScodeØ Timetakenbetweenthestops

Ø Dwelltimes

SelectionLogic:Majorvariableswereuseddirectlyorindirectlyforcalculatingmeantraveltime.Therewas

novariableselectionmethodsused.Thisisdirectcalculation.

PredictedVariable:meantraveltimeperpersoninagiventract(FIPScode)

Model:Formulaforcalculatingthemeantraveltimeisgivenbelow.

P a g e 11|52

MTT:Meantraveltimeofpeoplewhousepublictransportinaparticularblock

n:Numberofpeoplewhogetonthebus

t:Differenceintimetakenbetweengettingonthebusandgettingofthebus.

Code:Pythoncodefiles:CodeisinAppendix.

Ø getStopFIPS.py:thisistogettheFIPScodeofstopfromtheCensusAPI

Ø filter.py:Filteredmeantraveltimefromtractleveltocountylevel

Ø calculateMeanTT.py:Thisistocalculatethemeantime

Assumptions:Ø Locationatwhichapassengerboardsisassumedtobethatpassenger’shome/worklocationin

thatparticulartract.

Ø Thismaynecessarilynotbetrue.Itisimportanttorealizethatapersoncangetinandgetdownat

randomstopsaccording tohiswish.Though therearepatterns in thedata, there isnoway to

determinethesourceorthedestinationofapassengerfromthisdata.Thisaffectstheaverage

traveltimecritically.Thereisnodatatoidentifythepersoninquestionorforwhatpurposethe

persontravelsinthepublictransport.Meantraveltimecannotbecomputedatanindividuallevel.

Ø FIFOassumptionvisualized

Figure8section2

Output:

Meantraveltimevisualized

P a g e 12|52

Figure9Section2

Figure10Section2–ThevisualizationofmeantravelforbustransportatblocklevelforPittsburgh

P a g e 13|52

Challenge:

Ø ThebustransitdataonlyhasstopIDnumbersandstopnamewithnocoordinates.

Ø FindingthecoordinatesforthestopIDwasadifficulttask.TheteamsearchedseveralAPIs

includinggooglemapsandfinallyfoundthevaluesinPittsburghcityGISdata.

Ø JoinedthePittsburghcityGISdataandthebustransitdatausingstopIDasprimarykey

andobtainedthecoordinatesforthestop.

Ø Once the coordinateswereobtained, the coordinateswereused to get the FIPS code,

whichidentifiestheblocktowhichthestopbelongs.

Ø AllthebusstopswerethenassignedtotheirrespectiveblockusingFIPScode.

Ø ThestopsweregroupedaccordingtoFIPScode.

Ø Oncethisgroupingwasdone,averagetraveltimeforeachFIPScodewascalculatedusing

thepythoncode:calculateMeanTT.py

Conclusion:

Though,wemakeseveralassumptionstherelativemeantraveltimevaluesfordifferentblockscanstillgive

usagoodideaaboutthetrendsintraveltimeacrossblocks.

Forinstance,Meantraveltimeisstopsindowntownandquitelongforstopsinruralareas.Belowfigure

shows,thevisualizationRedindicatesmoretime.Blueindicateshighertime.

P a g e 1|52

StepII:PredictivemodelusinghistoricaldatafromACSalongwithinputsfromexternaldataset

Summary:Theteamtriedtoanswerthesequestionsbybuildingalinearregressionmodelbycombining

variablesfromACSdataandtheExternalDataSources.Themainmotivewastofindtheeffectofadding

ExternalDataSourcevariableontheACStraveltimeforalltractswithinPittsburgh.

Tool:RStudio,WEKA

Input:

DataSource TimePeriod Level Location #ofrecords #ofvariables

ACS 5year

estimates

CensusTract Pittsburgh 22 33

Pittsburgh

ParkingTerminals

2013and2014 City Pittsburgh 1

APCAVL 2013and2014 City Pittsburgh 1

DataCleaningStep:

Inputvariable:DetailsaboutcalculationofMeantraveltimeforpublictransportationismentionedinthe

previoussection.

Variablename Description(Allvariablesareintractlevel)

TRACTCE10 TRACTID

PublicMTT Pittsburghpublictransportmeantraveltimecalculatedfromexternal

dataset

ParkingTransactions NumberofparkingtransactioninPittsburgh

PopulationTotal Totalpopulation

IncomePerCapita Percapitaincome

MeansofTransportTotal Totalnumberofvehicles

Cartruckvan Numberofcars,trucksorvan

DroveAlone Numberofpeoplewhodrovealone

Carpooled NumberofPeoplewhocarpooled

Publictransport NumberofPeoplewhousedpublictransport

Bicycle NumberofPeoplewhousedbicycle

Walked Numberofpeoplewhowaked

Othermeans Numberofpeoplewhousedothermodesoftransport

Workedathome Numberofpeoplewhoworkedfromhome

Lessthan5minutes Numberofpeoplewhotooklessthan5minutestotraveltowork

5to9minutes Numberofpeoplewhotookbetween5to9minutestotraveltowork

10to14minutes Numberofpeoplewhotookbetween10to14minutestotravelto

work


work

P a g e 2|52


work


work


work


work


work


work


work

90ormoreminutes Numberofpeoplewhotookbetween90minutestotraveltowork

White Raceidentifier

AfricanAmerican Raceidentifier

Asian Raceidentifier

Other Raceidentifier

Male Numberofmales

Female Numberoffemales

AvgTT Averagetraveltime

Selection Logic: This is detailedunder the section “Factors affecting travel time inU.S” in “Appendix I:

LiteratureReview”.

Predictedvariable:Traveltimeattractlevel

Model:Linearregression

Code: Lm (formula = X$Avg_TT ~X$Public_MTT + X$Parking_Transactions + X$Population_Total +

X$Means_of_Transport_Total,data=X)

Output:

Belowistheresultoftheregressionmodelwithoutexcludingthehighlycorrelatedvariables.

P a g e 1|52

Figure11Section2

ItcanbeclearlyseenthatRsquareishigh.RsquareexplainsthevariabilityinthemodelandaHighRsquare

value(closeto1)impliestherearemanyvariablesthatareexplainingthevariationinthemodel.Pvalue

which denotes the significance is also lesser than 0.05. Therefore, we decided to exclude the highly

correlatedvariablesandruntheregressionmodelonlywiththeessentialvariables.Thetablebelowshows

thehighlycorrelatedvariablesthatwereexcluded.

P a g e 1|52

Figure12Section2

OutputofRegressionrunsafterexclusion:

Figure13Section2

GoodnessofFit:

Consideranylinearregressionmodel,whichlookslikethefollowing

Are all the assumptions - Normality of residuals or errors from the model, constant residual variance

throughouttherange–true?Itismostlyunlikely.Itisonlyplausiblethattheassumptionsarecloseenough.

Goodnessoffitdefineshowcloselytheassumptionsforthemodelareusefulinpractice.

Thereareseveralmeasuresforgoodnessoffittheyare:

• Examiningresidues

• AglobalmeasureofvarianceexplainedbyR2

• AglobalmeasureofvarianceadjustedfornumberofparametersinthemodelcalledadjustedR2

Residualscanbeuseddescriptively,bylookingateitherhistogramsorscatterplotsofresiduals.Consider

thefollowingmodel-

TheBCDresidualfortheBCDobservationisgivenby

WhereYiistheobserveddependentvariableandXijistheobservedcovariatesfortheBCDobservation.Nowletusplotthegraphandseetheoutcomes.

P a g e 2|52

Fit = lm(X$Avg_TT~

X$Public_MTT+X$Parking_Transactions+X$Population_Total+X$Means_of_Transport_Total)

>hist(fit$residuals,main="Histogramforresiduals",xlab="residuals")

>plot(X$Population_Total,fit$residuals)

>plot(X$Public_MTT,fit$residuals)

>plot(X$Parking_Transactions,fit$residuals)

>plot(X$Means_of_Transport_Total,fit$residuals)

P a g e 3|52

P a g e 4|52

Thehistogramisnotstronglynormalinshape,but,thenagain,therearejust22Observations.Scatterplots

lookreasonable,i.e.,noobviousdeparturesfromConstantvarianceorlinearity.

R2:Aglobalmeasureof“varianceexplained”

R2valueisrepeatedlyusedinregressiontoexplainthefit.Wewillnowseeexactlyhowtointerpretthis

measure.Considerfirstsimplelinearregression. Weareusually interestedinwhethertheindependent

variableisworthhaving,soreallywearecomparingthemodel

Y=α+βXtothesimplermodelY=α,usingtheinterceptonly.

Wedefine:

R2=1–{(sumofsquaredresidualsfrommodelwithαandβ/(sumofsquaredresidualsfrommodelwithαonly)}

Or

R2=1–{SS(res)/SS(total)}

whereSS(res)(oftenalsoreferredtoasSSEor“sumofsquaresoferror”)isdefinedasthesumofresidual

distancesfrommodelwithX,andSS(total)isthesame,butfromtheinterceptonlymodel.

Bydefinitionofleastsquaresregression,SS(res)≤SS(total),becauseifthebestregressionlinewasreally

usingαonly,thenSS(res)=SS(total),andinallothercases,addingβimprovesSS(res).

So,0≤R2

≤1.IfSS(res)=SS(total),thenR2

=0,andmodelisnotuseful.IfSS(res)=zero,thenR2=one,

andmodelfitsallpointsperfectly.Almostallmodelswillbebetweentheseextremes.

P a g e 5|52

Therefore,SS(res)showshowmuchcloserthepointsgettothelinewhenβisused,comparedtoaflatline

usingαonly(whichisalwaysY=α=Y).

Becauseofthis,wecancallR2the“proportionofvarianceexplainedbyaddingthevariableX”.

Essentiallythesamethinghappenswhenthereismorethanoneindependentvariable,exceptresidualsare

fromthemodelwithallXvariablesforthenumeratorinthedefinitionofR2.

Thus,R2

givesthe“proportion

ofvarianceexplainedbyaddingthevariablesX1,X2...Xp,iftherearepindependentvariablesinthemodel.

HowlargedoesR2

needtobetobeconsideredas“good”?Thisdependsonthecontext;thereisnoabsolute

answerhere.ForhardtopredictYvariables,smallervaluesmaybe“good”.Overall,R2providesauseful

measureofhowwellamodelfits,intermsof(squared)distancefrompointstothebestfittingline.However,

asoneaddsmoreregressioncoefficients,R2nevergoesdown,eveniftheadditionalXvariableisnotuseful.

Inotherwords,thereisnoadjustmentforthenumberofparametersinthemodel.

AdjustedR2

Inasimplelinearregression,pisthenumberofindependentvariables.

Itp=1thenAdjustedR2

=R2

.Asthenumberofparametersincreases,AdjustedR2

≤R2

.

Withthisdefinition:

R2=1–{(n−1)×sumofsquaredresidualsfrommodelwithαandβ/(n−p)×sumofsquaredresiduals

frommodelwithαonly}

Therefore, there issomeattempttoadjust for thenumberofparameters.Letusseethis forourmodel

below.

Call:

lm(formula=X$Avg_TT~X$Public_MTT+X$Parking_Transactions+

X$Population_Total+X$Means_of_Transport_Total)

Residuals:

Min1QMedian3QMax

-1.80948-0.614350.077580.522862.60368

Coefficients:

EstimateStd.ErrortvaluePr(>|t|)

(Intercept)9.301e+007.183e-0112.9503.11e-10***

X$Public_MTT7.227e-012.826e-012.5570.0204*

X$Parking_Transactions-4.102e-051.912e-05-2.1450.0467*

X$Population_Total-4.064e-033.614e-04-11.2452.70e-09***

X$Means_of_Transport_Total7.996e-036.955e-0411.4961.93e-09***

P a g e 6|52

---

Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’1

Residualstandarderror:1.155on17degreesoffreedom

MultipleR-squared:0.9006, AdjustedR-squared:0.8772

F-statistic:38.49on4and17DF,p-value:2.607e-08

Rreport:R2

=0.9006andAdjustedR2

=0.8772,so,ineithercase,about90%ofthetotalvarianceisexplained

bythevariablesused,whichisveryhigh.Atleastbythesemeasures,themodelfitswell.

Challenge:

Weinitiallycollectedthetractleveldataforthe33selectedvariablesandcombinedthemintoonefile.138

tractsfallunderAllegany-Pittsburghborders.

Oncethisdatawascollated,weusedtheTRACTIDandTRACTnameaskeytocombinetheExternalData

SourcevariablestotheACSdataset.

ParkingdataandAPCAVLdataarecollectedbydifferentauthorities.Theydonothaveacommonground

forcombining.Whenthedatasetsarebroughttoacommonlevelforexample,attractlevelitresultsin

missingdata.

ThemainreasonforthisisparkingdataisnotavailableforalltractsinPittsburghi.e.parkingspotsarenot

spreadacrossallareasinatractinPittsburgh.Similarly,buslinestravelacrossparticularroutesanddata

forsometractsarenotavailable.Whencombinedthisnaturallyresultsindatanotbeingavailableforsome

tracts.

Oncethedatasetswerecombineditresultedindataonlyfor22outof138Pittsburghtracts.

Thissamplesizeof22isextremelysmalltorunaregressionmodel.However,takingintoaccountthatthis

isonlyaproofofconceptwebuiltaregressionmodelusingR.

The regressionmodelwas initially runwith all the intuitively selected variables forACS alongwith the

externaldatasetvariables.

However,thismodelhadlargenumberofvariablesthatwerehighlycorrelatedandeffectsoftheexternal

datasetvariablescouldnotberealizedclearly.

Soweexcludedthehighlycorrelatedvariablesandranaregressionmodelonlywith4variables.2from

external data set. Publicmean travel time and number of parking transactions per tract and the ACS

variables.Totaltractpopulationandnumberofvehicles inatract.Theresultsarediscussedinthenext

page.

Conclusion:

Ø ThefirstRsquarevalueintheregressionresultsafterexcludingthesignificantvariables

is very high. Itmeans the existing variables in themodel explain the variability in the

P a g e 7|52

externaldatasetverywell.Thep-valueisalsolessthan0.05anditmeansthevariables

aresignificant.

Ø Modeoftransportforacommutercanbepredictedwithjustafractionofthecurrent

sampleavailabletoagooddegreeofaccuracy.

Ø Meantraveltimeforpeopleusingbusaspublictransportationhasbeencalculatedfor

blocklevel.Othercalculations,involvingdatafromexternaldatasource,includeparking

transactionsforall tractswithinPittsburghandaveragespeeddataandaveragetravel

time.

Ø UsingexternalandexistingACSdatameantravel timeforPittsburghtract level iswell

predicted. External data sets are not uniform. So, data needs to be cleaned and

normalized.Normalizationmayresultinaverysmallsample.Asaresult,thiscanonlybe

used as a proof of concept, further research is encouraged and results need to be

scrutinized.

Conclusion

Ø Mode of transport for a commuter can be predictedwell, evenwith a fraction of the

currentsample.

Ø MeantraveltimeatPittsburghtractleveliswellpredictedusingacombinationofAPCAVL

andPAparkingwiththeACSdata(87%ofthevariabilityexplained)

LessonsLearned

Data

Realworlddataisextremelydifferentandmultifaceted.Datasetsarecomplexaswellaslargein

volume. Useful information must be extracted from raw data. Even after extracting useful

informationmissing values and normalization issues can occur. Large data sets further create

problems when it comes to computing. They require huge processing power and distributed

computing.Laptopsandminicomputerscannothandlesuchdatahencetheyneedtobemoved

tothecloud.Thisresultsinextracosts.Sobetterandmoreefficientcodeneedstobedeveloped.

Assumptions

Whilebuildingmodelsusingdatamining severalassumptions likeFIFO,Queueareused these

assumptionscangivemodelsthatareverygoodforcreatingproofofconceptsbutnotentirely

accuratemodels.Forbuilding,moreaccuratemodels,oneneedstodeveloporusecustomized

machinelearningalgorithmsbasedonthedataset.

P a g e 8|52

SuggestionsforFutureWork

Travelmodemodel can be expanded to national level using the nationwide datawithmore

computingpoweranddistributedcomputing.

Expandlinearregressionmodeltopredicttraveltimefor

Ø AlltractsofPittsburgh.

Ø Statewideandnationwideusingotherexternaldatasets.

Travel mode model can be further refined by including the congestion and weather as

independentvariables.

Acknowledgement

ThiscapstoneprojectwasundertakenbytheteamforCentreforAdaptiveDesign,USCensusBureau,with

support fromCarnegieMellonUniversity.We thankour advisor Professor SeanQian for his continued

supportandguidance.Wealsothankourclientfortheirpatienceandsupport.

Thisprojectallowedsomeofustohoneouranalyticalskillswhileothersgainedanunderstandingofthe

analytics field.We thank our Heinz college, CarnegieMellon University, for providing us this learning

opportunity.

AppendixI:LiteratureReview

NationalHighwayTravelSurveyDataAnalysis

Source: The Impacts of Socio-Economic and Demographic Shifts in Transit Served Neighborhoods OnModeChoiceAndEquitybyStevenApell

P a g e 9|52

Highlighttheimpactsofsocio-economicanddemographicchangesonTOD.TODstandsforTransit-oriented

development. It is amixed-use residential and commercial areadesigned tomaximize access topublic

transport,andoften incorporates features toencourage transit ridership.Thepaperalsoevaluates the

overalleffectivenessofTODpolicyinsupportingasustainablecommunity.

Answeredfollowingfourquestions:

Ø Howhasthesocio-economicanddemographiccharacterofcommunitieschangedinblockgroups,

whicharewithin0.5mileand0.5-1.0mileofTODandnon-TODstations?

Ø IsthepresenceofTODinterrelatedwiththeoccurrenceofgentrificationinblockgroupswithin0.5

and0.5-1.0mileoftransitstations?

Ø Howhavechangesinsocio-economicanddemographicfactorsinfluencedmodechoiceinblock

groupswithin0.5and0.5-1.0mileradiusofTODandnon-TODstations?

Ø Whatarethedifferencesbetweensocio-demographiccharacteristicsofcommunitieslivinginTOD

comparedwith communities innon-TOD; further,whatdifferencesexistbetweencommunities

within0.5milesofTODcomparedwithcommunitieslivingwithin0.5-1.0mileradius?

Helpedidentifythefollowingfactorsthataffectthemodeoftransportchosenbyacommutertotraveltowork:

Ø Percentageofhouseholdswithnocarormorethanonecar

Ø Medianandpercapitaincome

Ø Percentageofcollegegraduates

Ø Medianhousingvalue

Ø Mediancontractrent

Ø Percentageofownerandrenteroccupiedhousing

Ø Racialcategories(PercentagesforBlackandWhite)

Ø Percentageofcollegegraduates

Ø EmploymentType

Ø LengthofResidency

FactorsaffectingtraveltimeinUS

Source:CommutinginAmericaReportIII-TheThirdNationalReportonCommutingPatternsandTrendsbyAlanE.Pisarski

Thisreportexaminescommutingpatternsintermsoflongstandingtrendsandemergingfactorsthataffect

commutingeveryday.Itexaminesthedatatrendidentifiedthroughtheresponsestoeachtravelquestion

ofACSindetail.

Traveltimeisafunctionofbothspeedanddistance.Theincreaseinworktripdistanceswhenmatchedby

the increase inworktravelspeeds indicatesanactual improvement inworktravelspeeds.Travel times

increasingfasterthanthetraveldistancesmightindicatetheeffectofcongestioninthetraveltime.Travel

distances growing faster than travel speeds might indicate that the commuter prefers staying in a

neighborhoodwithreducedhousingcostandthisneighborhoodmightbeataconsiderabledistancefrom

hisworklocation.

P a g e 10|52

Ithasalsobeenobservedinpastthatwomenmakefrequentstopswhiletravellingandassuch,theirtravel

timewillbemore.Areassmaller in sizewill tend tosendat least someportionofcommuters toother

metropolitanareasforwork.Familieswhohavetoddlerstravellessandalot.

Ithelpedustointuitivelyidentifythefactorsthatcouldaffecttravel-timeofacommuter.Thevariablesidentifiedwereasfollows–

Ø Age

Ø Gender

Ø MaritalStatus

Ø Presenceandageofownchildren

Ø Gavebirthtochildwithinpast12months

Ø Frequentresidenceshifters

Ø Citizenshipstatus-categorical

Ø Education

Ø Income

Ø Numberofvehiclesinthehousehold

Ø Monthlyrent

Ø Numberofbreadearnersinhousehold

BusTransitTime

Source:ImpactofTrafficCongestiononBusTravelTimeinNorthernNewJerseybyClaireE.McKnight,HerbertS.Levinson,KaanOzbay,CamilleKamga,andRobertE.Paaswell

Thispaperdetails the impactofcongestiononthebustravel timerate. Italsoprovidesdetailsabouta

regressionmodeldevelopedtoestimatetraveltimerate(inminutespermile)ofabusasafunctionofcar

traffictimerate,numberofpassengershoardingpermile,andthenumberofbusstopspermile.

Itdetailsthemethodologyinthefollowingway-

Ø Giveabriefoverviewofthepreviousstudiesconducted

Ø Describethedatacollectionandinitialanalysisofdata

Ø Definethemodel

Ø Statetheconclusions

PredictBustraveltimeforaparticularroute,particularbusandparticulartimeslot.

Aroutesegmentisdefinedasasectionofroutebetweentwoadjacenttimepoints,withatimepoint(TP)

beingthelocationatwhichtheschedulehadarecordedtime.

Busspeedvariationbetweenstopscanindicatecongestion.A.mpeak–7amto10am;Midday–10am

to4pm;P.mpeak–4pmto7pm;Postp.mpeak–after7pm.

Factors Characteristics

Bus Ø #ofstopsØ StopspacingØ Dwelltimesatstops

P a g e 11|52

Ø #ofpassengersboardingØ #ofpassengersalighting

Route Ø SegmentlengthØ #oftrafficsignalsØ #ofleftturnsinroute

Traffic Ø Cartraveltimerate=trafficlevelØ #ofvehiclesinqueuewaitingtomakeleftturnØ #oftaxismakingsuddenstopsorturnstopickup/dropoffpassengersØ #ofvehiclesontheroad

Parking Ø #ofdoubleortripleparkedcars

INRIXTrafficData

Source:INRIXInterfaceGuideDecember2014

ThisreportprovidesdetailedinformationabouttheINRIXdata.Italsoactsasaguidetohelpunderstand

howtomakeuseofINRIXdata.

Itcontainsthefollowinginformation-

Ø InterpretingTMCcodes

Ø UnderstandingXDsegments,sub-segmentsandINRIX-managedsetfiles

Ø Integratinggraphicaltrafficdatainuserapplications

Ø Generatingspeedbuckets

Thedatadictionary included in this reporthelpedus tounderstand the variables in the inputdata file

containingtheINRIXtrafficspeeddata.

UrbanCongestionLevel

Source:2015UrbanMobilityScorecard

It is published jointly by the Texas A&M Transportation Institute and INRIX. This report details the

methodologyusedtoestimatethecongestionlevelatanurbanarea-widelevel.Thisallowsforcomparison

ofcongestionlevelinaconsistentwayacrossurbanareas.

Itusesacombinationoftrafficvolumedataalongwiththetrafficspeeddatatocomputethecongestion

levelmeasures.Allthemeasuresandmanyoftheinputvariablesforeachyearandeverycityareprovided

inaspreadsheetthatcanbedownloadedathttp://mobility.tamu.edu/ums/congestion-data/.Theroadway

inventorydatasourceformostofthecalculationsistheHighwayPerformanceMonitoringSystemfromthe

FederalHighwayAdministration.TrafficvolumedataissourcedfromINRIX.

Thefollowingstepswereusedtocalculatethecongestionperformancemeasuresforeachurbanroadwaysection.

Ø ObtainHPMStrafficvolumedatabyroadsection

Ø MatchtheHPMSroadnetworksectionswiththeINRIXtrafficspeeddatasetroadsections

P a g e 12|52

Ø Estimatetrafficvolumesforeachhourtimeintervalfromthedailyvolumedata

Ø Calculateaveragetravelspeedandtotaldelayforeachhourinterval

Ø Establishfree-flow(i.e.,lowvolume)travelspeed

Ø Calculatecongestionperformancemeasures

Ø Additionalstepswhenvolumedatahadnospeeddatamatch

Themobilitymeasuresrequirefourdatainputs:

Ø Actualtravelspeed

Ø Free-flowtravelspeed

Ø Vehiclevolume

Ø Vehicleoccupancy(personspervehicle)tocalculateperson-hoursoftraveldelay

This paper could be referred in future to calculate the congestion factor to be included as another

independentvariableinthetravel-timepredictingmodel.

P a g e 13|52

AppendixII:InformationonExternalDataSources

NHTS:NationalHouseholdtravelsurvey

Data:NHTS2001and2009

NHTS isatravelsurveythatcollects informationaboutpeople,especiallyan individual’stravelbehavior

over a period through questionnaire. These surveys collect information about demographics of the

individual. Especially information about socio-economic status, household (size and structure), vehicle

data,travelmode,vehiclesowned,vehiclesused,purposeofthejourney,startingpointandendingpoint

ofthejourneyandnumberofpeoplewhothepersontravelledwith.

TheNationalHouseholdTravelSurvey(NHTS)providesinformationtoassisttransportationplannersand

policymakerswhoneedcomprehensivedataontravelandtransportationpatternsintheUnitedStates.

The 2009 NHTS updates information gathered in the 2001 NHTS and in prior Nationwide Personal

TransportationSurveys(NPTS)conductedin1969,1977,1983,1990,and1995.

Datacollected:

TheNHTS/NPTSactsasnation’sinventoryoftraveldata.Thedataiscollectedondailytripstakenina24-

hourperiod,andincludes:

Ø Purposeofthetrip(work,shopping,etc.)

Ø Meansoftransportationused(car,bus,subway,walk,etc.)

Ø Traveltime

Ø Timeofdaywhenthetriptookplace

Ø Dayofweekwhenthetriptookplace

Foraprivatevehicletrip,italsocaptures

Ø numberofpeopleinthevehicle,i.e.,vehicleoccupancy

Ø Drivercharacteristics(age,sex,workerstatus,educationlevel,etc.)

Ø Vehicleattributes(make,model,modelyear,amountofmilesdriveninayear)

Scope–WhattheNHTSIncludes

Ø Householddataonmembers,education,income

Ø Housingcharacteristics,informationoneachvehicle,includingyear,make,model,andestimates

ofannualmilestraveled

Ø Informationontravelaspartofwork;Dataaboutone-waytripstakenduringadesignated24-hour

period

Ø Informationtodescribecharacteristicsofgeographicalarea

Scope—WhatIsNotIncluded

Ø Costsoftravel

Ø Specifictravelroutesortypesofroadsused

P a g e 14|52

Ø Travelofthesampledhouseholdchangesovertime

Ø Informationthatwouldidentifytheexacthouseholdorworkplacelocation

APCAVL-Automaticpassengercounterandautomaticvehiclelocationdata

Location:Pittsburgh

Time:2012,2013,2014

Dataused:2013/2014

AVLAutomaticVehicleLocation(AVL)Systems,isapartofITS(intelligenttransportationsystem),

whichhasbeenadoptedbymanytransitagenciestotracktheirtransitvehiclesinrealtime.Itisessentially

avehicletrackingsystemthatcombinestheuseofautomaticvehiclelocationgatheredbyGPSattachedto

individualvehicleswithsoftwarethatcollectsafleetofdataforacomprehensivepicture.Thedatacollected

includesvehiclelocations,howlongtheytravel,wheretheystop,howlongittakestomovefromonestop

to another and the distance between the stops. Urban public transit authorities are an increasingly

commonuserofvehicletrackingsystems,particularlyinlargecitieslikePittsburgh.

AutomatedPassengerCounter (APC) on theotherhand is a device installedon transit vehicles,whichaccurately records boarding and alighting data. This device accurately tracks transit ridership when

compared to the traditionalmethods of accounting by drivers or estimation through surveying. These

devicesarebecomingmorecommonamongAmericantransitoperatorsseekingtoimprovetheaccuracy

ofreportingpatronageaswellasanalyzingtransitusepatternsbylinkingboardingandalightingdatawith

stoplocation.

AschematicdiagramforAVLAPCsystemisgivenbelow.

AUTOMATICVEHICLELOCATION

References:ntl.bts.gov

P a g e 15|52

AUTOMATICPASSENGERCOUNTER

References:Pinterest

Figure2Section1

AsampleofhowAPCAVLsystemcollectsdataisgivenbelowalongwiththevariabledescription.

P a g e 2|52

P a g e 1|52

Figure3section1

Variabledescription

Figure4section1

PittsburghParkingTerminals

Location:Pittsburgh

Time:2013and2014

Thedata(showninthesnapshotbelow)isforPittsburghdowntownparkingmeteroccupancycontainsall

DowntownPittsburghparkingterminalsidentifiedby"terminalID".Thedatafilegiventotheteamwasin

geoJSONformatandPPA_terminalscontainsparkingoccupancydataforentirePittsburgh

PittsburghparkingterminalsdatafileingeoJSONformat

Figure5section1

PittsburghParkingterminalconvertedtotableformat

P a g e 2|52

Figure6section1

INRIX

INRIXisaglobalSaaSandDaaScompany,whichprovidesavarietyofInternetservicesandmobile

applications pertaining to road traffic and driver services. INRIX provides historical, real-time traffic

information,trafficforecasts,traveltimes,traveltimepolygonsandtrafficcount.(References:Inrixwebsite

andWiki)

Wehadaccessto informationforPittsburghandall fields for INRIXdataforNorthbound,southbound,

eastbound,westbound,clockwiseandcounterclockwiseroadsinAlleghenycounty,PA(1930tmcs).The

variabledescriptionandsnapshotisprovidedinnextpage.

P a g e 3|52

SnapshotofINRIXdata

Figure7section1

P a g e 4|52

AppendixIII:DataCleaningSteps

TOOLSUSED:

Ø RStudio

Ø Weka

Ø Excel

APCAVL:Automaticpassengercounterandautomaticvehiclelocationdata

Availabledata:September2012,2013and2014.(RefertoFigure3and4inSection1).Thedatasetclearly

containsmanyvariables.However,lookingatitcloselytheteamrealizedthatitdoesnotcontainthemost

importantvariablesthatidentifythelocationofthebusstop,whichisthegeographiccoordinates.

Geographiccoordinatesactasthemissinglinkbetweenfindingthetractsthroughwhichthebus

transits. In order to find this data, we used the Pittsburgh GIS data from the Pittsburgh City website.

http://pittsburghpa.gov/dcp/gis/gis-data

Keyused:BusStopID

Thefollowingfileswereusedtoobtainthemissinginformationandcombinethedatasets

Ø CensusTracts2010:ContainsalldetailsaboutcensustractsinPittsburgh

Ø Neighborhoods:ContainsdetailsaboutallneighborhoodsinPittsburgh

Ø CensusBlocks2010:ContainsdetailsaboutallblocksandblockgroupsinPittsburgh

Ø Port Authority Bus Stops: Contains details about all bus stops and their respective stop IDs in

Pittsburgh

PittsburghDowntownparkingterminals

Available data: September 2013 and 2014. Refer to figure 6 section 1. Missing information about

neighborhoodandtracts.Datawasataverygranularlevel(TerminalIDlevelthatisageographicpoint).

Therefore,itneededtobescaleduptotractlevel.WeusedtheterminalIDscoordinateswhichactedas

themissinglinkbetweenfindingthetracts.Inordertofindthisdata,weusedthePittsburghGISdatafrom

thePittsburghcitywebsite.Keyusedtojointhedatasetiscoordinates.






P a g e 5|52

Afterobtainingthedata,weidentified138majortractsinPittsburgh,isolatedtheparkingdataforallthese

138tracts,andaggregatedthenumberofparkingtransactionsineachtract.Thisdatawasveryrelevantin

creatingtheregressionmodelthatwillbediscussedlater.

INRIX

Availabledataset:September2011and2013.Refer(figure7section1).Thisdatasetconsistsofvehicle

speedingdatafordifferentTMCroadsegmentsanditsrespectivecoordinates.Again,themissingdatais

tractFIPScode.Weusedthecoordinateswhichactedasthemissing linkbetweenfindingthetracts. In

OrdertofindthisdataweusedthePittsburghGISdatafromthePittsburghcitywebsite.Keyusedtojoinis

coordinates.






P a g e 6|52

AppendixIV:Models

LogisticRegression

Itisastatisticalmethodforanalyzingadatasetinwhichthereareoneormoreindependentvariablesthat

determineanoutcome.Theoutcomeismeasuredwithadichotomousvariable(inwhichthereareonly

twopossibleoutcomes).

AppliedinNHTSmodeltopredictmodeoftransporttakenbycommutertoreachworklocation.

MultinomialLogisticRegression

Itisthelinearregressionanalysistoconductwhenthedependentvariableisnominalwithmorethantwo

levels.Thus,itisanextensionoflogisticregression,whichanalyzesdichotomous(binary)dependents.

AppliedinACSmodeltopredictmodeoftransporttakenbycommutertoreachworklocation.

ArithmeticMean

Itissimply"mean")ofasample ,usuallydenotedby ,isthesumofthesampledvalues

dividedbythenumberofitemsinthesample:

SimpleLinearRegression

Itistheleastsquaresestimatorofalinearregressionmodelwithasingleexplanatoryvariable,fitsastraight

linethroughthesetofnpointsinsuchawaythatmakesthesumofsquaredresidualsofthemodel(that

is,verticaldistancesbetweenthepointsofthedatasetandthefittedline)assmallaspossible.

Theoutcomevariableisrelatedtoasinglepredictor.Theslopeofthefittedlineisequaltothecorrelation

betweenyandxcorrectedbytheratioofstandarddeviationsofthesevariables.Theinterceptofthefittedlineissuchthatitpassesthroughthecenterofmass(x,y)ofthedatapoints.

Appliedinmodelusedtopredicttimetakenbyacommutertoreachhisworklocation.

P a g e 7|52

AppendixV:Code

Codeusedtopredictmodeoftransport:Pittsburgh_Mode_Choice.R

(ApplicableforboththestepsofTask1)

library("nnet")

#Functionforimportcsvfile

import.csv<-function(filename){

return(read.csv(filename,sep=",",header=TRUE))

}

#Importdataset

pa<-import.csv('mergedPUMS2013.csv')

#ExtractonlyPittsburghCitydatabyPUMAcode

pit<-subset(pa,PUMA==01701|PUMA==01702)

#Createanemptydataframeforlateruse

final<-data.frame()

#RandomSampling

for(iin1:1000){

#Shuffle

pit<-pit[sample(nrow(pit)),]

#20%

#pit.20<-pit[1:189,]

#40%

#pit.40<-pit[1:377,]

#60%

#pit.60<-pit[1:566,]

#80%

#pit.80<-pit[1:754,]

#100%

P a g e 8|52

pit.100<-pit

#Combinetheresults

final<-rbind(final,results(pit.100))

}

#Replacemissingvalues

final[is.na(final)]<-0

#Calulatethemeanaccuracyasfinalresults

onethousandtimes.hundredpercent<-data.frame(Car=mean(final$Car),

Bus=mean(final$Bus),

StreetCar=mean(final$Streetcar),

Subway=mean(final$Subway),

Railroad=mean(final$Railroad),

Ferryboat=mean(final$Ferryboat),

Taxicab=mean(final$Taxicab),

Motorcycle=mean(final$Motorcycle),

Bicycle=mean(final$Bicycle),

Walked=mean(final$Walked),

Worked.at.home=mean(final$Worked.at.home),

Other.method=mean(final$Other.method)

)

#Functionforthemodel

results<-function(df){

#Variables

variables <-

c("JWMNP","FMRGIP","MIG","PINCP","COW","JWRIP","INTP","FLANXP","PWGTP","SPORDER","PERNP","HI

NS2","TYPE","FS","RACASN","RNTM","JWTR")

#Subsetthedatasetbyprovidedvariables

df<-df[variables]

#Dataprocessing

P a g e 9|52

df<-subset(df,JWTR!="")

df[is.na(df)]<-0

df$FMRGIP<-as.factor(df$FMRGIP)

df$MIG<-as.factor(df$MIG)

df$COW<-as.factor(df$COW)

df$JWRIP<-as.factor(df$JWRIP)

df$FLANXP<-as.factor(df$FLANXP)

df$HINS2<-as.factor(df$HINS2)

df$TYPE<-as.factor(df$TYPE)

df$FS<-as.factor(df$FS)

df$RACASN<-as.factor(df$RACASN)

df$RNTM<-as.factor(df$RNTM)

df$JWTR<-as.factor(df$JWTR)

private<-subset(df,JWTR=="1")

public<-subset(df,JWTR!="1")

#Functionfor10folderscrossvalidation

smoothCV<-function(x,y,K=10){

result<-data.frame()

x<-c(x)

y<-c(y)

df<-data.frame(x,y)

df<-df[sample(nrow(df)),]

folds=cut(1:nrow(df),breaks=K,labels=FALSE)

for(jin1:K){

train<-df[folds!=j,]

test<-df[folds==j,]

fit<-multinom(JWTR~.,data=train,MaxNWts=8268)

predicted2<-predict(fit,test,type="class")

P a g e 10|52

result<-rbind(result,data.frame(pred=predicted2,true=test$JWTR))

}

return(result)

}

#Runcrossvalidation

x<-df[,names(df)!="JWTR"]

y<-data.frame(df$JWTR)

colnames(y)<-c("JWTR")

result<-smoothCV(x,y,10)

#Createvariablesforlateruse

Car<-0

Bus<-0

Streetcar<-0

Subway<-0

Railroad<-0

Ferryboat<-0

Taxicab<-0

Motorcycle<-0

Bicycle<-0

Walked<-0

Worked.at.home<-0

Other.method<-0

#Accumulatetheoccurrencesofeachtransportationtool

for(iin1:nrow(result)){

if(result$true[i]=="1"&&result$pred[i]=="1"){

Car<-Car+1

}elseif(result$true[i]=="2"&&result$pred[i]=="2"){

Bus<-Bus+1


P a g e 11|52

Streetcar<-Streetcar+1


Subway<-Subway+1


Railroad<-Railroad+1


Ferryboat<-Ferryboat+1


Taxicab<-Taxicab+1


Motorcycle<-Motorcycle+1


Bicycle<-Bicycle+1


Walked<-Walked+1


Worked.at.home<-Worked.at.home+1


Other.method<-Other.method+1

}

}

#Calculatetheaccuracy

car<-subset(df,JWTR=="1")

car.accurary<-Car/nrow(car)

bus<-subset(df,JWTR=="2")

bus.accurary<-Bus/nrow(bus)

streetcar<-subset(df,JWTR=="3")

streetcar.accurary<-Streetcar/nrow(streetcar)

subway<-subset(df,JWTR=="4")

P a g e 12|52

subway.accurary<-Subway/nrow(subway)

railroad<-subset(df,JWTR=="5")

railroad.accurary<-Railroad/nrow(railroad)

ferryboat<-subset(df,JWTR=="6")

ferryboat.accurary<-Ferryboat/nrow(ferryboat)

taxicab<-subset(df,JWTR=="7")

taxicab.accurary<-Taxicab/nrow(taxicab)

motorcycle<-subset(df,JWTR=="8")

motorcycle.accurary<-Motorcycle/nrow(motorcycle)

bicycle<-subset(df,JWTR=="9")

bicycle.accurary<-Bicycle/nrow(bicycle)

walked<-subset(df,JWTR=="10")

walked.accurary<-Walked/nrow(walked)

worked.at.home<-subset(df,JWTR=="11")

worked.at.home.accurary<-Worked.at.home/nrow(worked.at.home)

other.method<-subset(df,JWTR=="12")

other.method.accurary<-Other.method/nrow(other.method)

accu<-data.frame(Car=car.accurary,

Bus=bus.accurary,

Streetcar=streetcar.accurary,

Subway=subway.accurary,

Railroad=railroad.accurary,

Ferryboat=ferryboat.accurary,

Taxicab=taxicab.accurary,

Motorcycle=motorcycle.accurary,

Bicycle=bicycle.accurary,

Walked=walked.accurary,

Worked.at.home=worked.at.home.accurary,

Other.method=other.method.accurary

P a g e 13|52

)

return(accu)

}

CodeusedtogettheFIPScodeofstopfromtheCensusAPI:getStopFIPS.py

importre

importrequests

frombs4importBeautifulSoup,Tag

importcsv

defmain():

a=1

map={}

withopen('stopIDFIPS','w',newline='')asfp:

a=csv.writer(fp,delimiter=',')

title=["stopID","FIPS"]

a.writerow(title)

withopen('PAAC_Stops_1603_Public2.csv','r')ascsvfile:

readCSV=csv.reader(csvfile,delimiter=',')

forrowinreadCSV:

ifa==1:

a=2

continue

all=row[4]+"+"+row[5]

value=map.get(all)

ifvalue==None:

url="http://data.fcc.gov/api/block/find?format=json&latitude="+row[4]+"&longitude="+row[5]+"

&showall=true"

html=requests.get(url).content

htmltxt=BeautifulSoup(html,'html.parser')

matchobj=re.findall(r'{"Block":{"FIPS":"(\d+)"},"County":',str(htmltxt))

iflen(matchobj)>0:

map[all]=matchobj[0]

value=matchobj[0]

ifvalue!=None:

data=[row[0],value]

a.writerow(data)

main()

P a g e 14|52

Codeusedtofiltermeantraveltimefromtractleveltocountylevel:filter.py

importre

importrequests

frombs4importBeautifulSoup,Tag

importcsv

defmain():

a=1

begin=set()

map={}

PUMAMap={}

withopen('filteredMeanTT.csv','w',newline='')asfp:


title=["PUMAtract","meanTT"]

a.writerow(title)

withopen('NeighborhoodPuma.csv','r')ascsvfile:


forrowinreadCSV:

ifa==1:

a=2

continue

puma=row[0]

tract=row[2][0:11]

begin.add(tract)

PUMAMap[tract]=puma

withopen('FIPSMeanTT.csv','r')asinput:

readInput=csv.reader(input,delimiter=',')

forrowinreadInput:

try:

FIPS=row[0]

time=float(row[1])

FIPSstart=FIPS[0:11]

except:

continue

ifFIPSstartinbegin:

pumaValue=PUMAMap.get(FIPSstart)

value=map.get(pumaValue)

ifvalue==None:

map[pumaValue]=[time,1]

else:

map[pumaValue]=[value[0]+time,value[1]+1]

forkey,valueinmap.items():

meanTT=value[0]/value[1]

data=[key,meanTT]

a.writerow(data)

P a g e 15|52

main()

Codeusedtocalculatethemeantime:calculateMeanTT.py

importre

importrequests

importcsv

fromosimportlistdir

fromos.pathimportisfile,join

fromcollectionsimportdeque

defmain():

countmiss=0

countsucc=0

FIPSMap={}

trackMap={}

d=deque()

countMissOff=0

withopen('FIPSMeanTT.csv','w',newline='')asfp:


title=["FIPS","meanTT"]

a.writerow(title)

withopen('stopIDFIPS.csv','r',newline='')ascsvfile1:

readCSV1=csv.reader(csvfile1,delimiter=',')

forrowinreadCSV1:

FIPSMap[row[0]]=row[1]

mypath='/Users/Yun/Documents/capstone/findTrackPublicMeanTT/StopInfo'

onlyfiles=[fforfinlistdir(mypath)ifisfile(join(mypath,f))]

forfileinonlyfiles:

iffile.endswith(".csv"):

withopen(mypath+"/"+file,'r')ascsvfile:


forrowinreadCSV:

iflen(row)>20:

ifrow[0]!='DOW':

try:

stopID=row[8].strip()

on=int(row[16])

off=int(row[17])

min=float(row[20])

exceptException:

continue

FIPS=FIPSMap.get(stopID)

P a g e 16|52

ifFIPS==None:

countmiss=countmiss+1

continue

else:

countsucc=countsucc+1

foriind:

trackInfo=trackMap.get(i)

iftrackInfo==None:

trackMap[i]=[min,0]

else:

iflen(trackInfo)==2:

trackMap[i]=[trackInfo[0]+min,trackInfo[1]]

trackInfoCur=trackMap.get(FIPS)

iftrackInfoCur==None:

trackMap[FIPS]=[0,0]

trackInfoCur=[0,0]

trackMap[FIPS]=[trackInfoCur[0],trackInfoCur[1]+on]

whileon>0:

d.append(FIPS)

on=on-1

whileoff>0:

iflen(d)==0:

countMissOff=countMissOff+off

break

else:

d.popleft()

off=off-1

whilecountMissOff>0:

iflen(d)>0:

d.popleft()

countMissOff=countMissOff-1

else:

break

forkey,valueintrackMap.items():

ifvalue[1]==0:

continue

else:

meanTT=value[0]/value[1]

print(key,value)

data=[key,meanTT]

a.writerow(data)

print(countmiss,countsucc)

main()

P a g e 17|52

Codeusedtopredictthetravel-time:travel_time.R

Lm (formula = X$Avg_TT ~X$Public_MTT + X$Parking_Transactions + X$Population_Total +

X$Means_of_Transport_Total,data=X)

P a g e 18|52

References

Ø NationalHouseholdTravelSurveyDataAnalysis:Ø U.S.CensusBureauprojectionsto2060methodology:

http://www.census.gov/population/projections/data/national/2014.htmlØ U.S.CensusBureauprojectionsto2060summarytables:

http://www.census.gov/population/projections/data/national/2014/summarytables.htmlØ SummaryreportabouttheU.S.CensusBureauprojectionsto2060:

http://www.census.gov/content/dam/Census/library/publications/2015/demo/p25-1143.pdfØ ACSLongitudinalEmployer-HouseholdDynamics:Job-to-JobFlows(J2J)Data(Beta)DataFiles:

http://lehd.ces.census.gov/data/j2j_beta.htmlØ ACSLongitudinalEmployer-HouseholdDynamics:Job-to-JobFlows(J2J)Data(Beta)Data

dictionary:http://lehd.ces.census.gov/data/schema/V4.1c-draft/lehd_public_use_schema.html

Ø MeasuringCommutingintheAmericanTimeUseSurvey:http://bae.uncg.edu/econ/files/2015/02/Kimbrough-working-paper-15-02.pdf

Ø OhioTIMES:transportationinformationmappingsystemhttp://gis.dot.state.oh.us/tims/Ø 2011OregonHouseholdActivitySurvey:

http://www.oregon.gov/ODOT/TD/TP/Pages/Data.aspxØ TheImpactsofSocio-EconomicandDemographicShiftsinTransitServedNeighborhoodsOn

ModeChoiceAndEquitybyStevenApellØ DemographicTrendsTheEffectsofDemographicChangeonSelectedTransportationServicesand

DemandbyMurdockSH,ClineME,ZeyM,PerezD,&JeantyPWØ ModeChoicesofMillennials:HowDifferent?HowEnduring?byRobertB.Case,PE,PhD,Seth

SchipinskiØ TravelTimeUseOverFiveDecadesbyChenSong&ChaoWeiØ NortheastTravelChoiceSurvey(NTCS)UniversityofVermont’sTransportationResearchCenter

(datanotpublic)Ø 2015UrbanMobilityScorecardpublishedjointlybytheTexasA&MTransportationInstituteand

INRIXØ INRIXInterfaceGuideDecember2014publishedbyINRIXØ ImpactofTrafficCongestiononBusTravelTimeinNorthernNewJerseybyClaireE.McKnight,

HerbertS.Levinson,KaanOzbay,CamilleKamga,andRobertE.PaaswellØ CommutinginAmericaReportIII-TheThirdNationalReportonCommutingPatternsandTrends

byAlanE.PisarskiØ RStudioTool:https://www.rstudio.com/products/rstudio/download/Ø WekaTool:https://weka.waikato.ac.nz/explorer Ø NationalHouseholdTravelSurveyData:Ø INRIXdata:Heinz College,Carnegie Mellon University Research CentreØ PittsburghDowntownParkingTerminals:Heinz College, Carnegie Mellon University

Research CentreØ Automaticpassengercounterandautomaticvehiclelocationdata:Heinz College, Carnegie

Mellon University Research Centre

P a g e 19|52

Ø PittsburghGISinformation:http://pittsburghpa.gov/dcp/gis/gis-dataØ DescriptionofStatisticalModels:Wikipedia and www.statisticssolutions.comØ GoodnessofFitexplanationforstatisticalmodel:

http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-621/fit