capstone project - u.s. census bureau, center for adaptive...

54
CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive Design Title of the Project: Combining Traditional and New Data Sources in the Production of Official Statistics Timing: Spring 2016 Client: The Center for Adaptive Design at the U.S. Census Bureau Issue Definition: The U.S. Census Bureau is the largest statistical agency in the U.S. Government. In addition to the once-a-decade population count, the Census Bureau collects data on the U.S. economy, healthcare, poverty, education, crime, and business innovation, among others. The predominant method used for collecting data by the Census Bureau is the traditional sample survey administered to households, businesses and government entities. Response rates to sample surveys, however, are declining rapidly while survey costs are climbing at an unsustainable rate. Further, the Census Bureau is highly interested in reducing public burden (the effort and time spent responding to surveys). This combination of challenges is driving the Census Bureau to explore alternative data collection methods, including passive approaches using Big Data techniques. The overall issue this project will help address is how to combine the data from new sources with data collected using the traditional approaches. The goal is to increase efficiency and timeliness, reduce cost and burden, and maintain data quality. Scope of Work: A particular line of survey questions that have proven to be problematic are those surrounding the topic of people’s travel and commuting habits. With the availability of traffic and commuting data from public and private sources, as well as crowd-sourced applications (such as Waze) the Census Bureau should be able to access data that may enable a reduction in travel and commuting questions in our sample surveys (such as the American Community Survey). This project would research possible data sources to infer selected travel and commuting statistics in the American Community Survey, suggest methods and tactics for accessing and interpreting those sources, and explore approaches for verifying the quality of these sources with the goal over time of incorporating them into official statistics. Possible data sources include roadway traffic counts, travel time, parking, public transportation, etc. Work also involves devising methods to test new data source quality against established survey data benchmarks, as well as suggesting a viable long-term plan for transitioning to new data sources. Expected Deliverables: Research and recommendations on possible new data sources to complement/replace targeted questions on existing sample surveys such as the American Community Survey

Upload: others

Post on 10-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

CAD11-30-15

CapstoneProject-U.S.CensusBureau,CenterforAdaptiveDesignTitleoftheProject:CombiningTraditionalandNewDataSourcesintheProductionofOfficialStatisticsTiming:Spring2016Client:TheCenterforAdaptiveDesignattheU.S.CensusBureauIssueDefinition:TheU.S.CensusBureauisthelargeststatisticalagencyintheU.S.Government.Inadditiontotheonce-a-decadepopulationcount,theCensusBureaucollectsdataontheU.S.economy,healthcare,poverty,education,crime,andbusinessinnovation,amongothers.ThepredominantmethodusedforcollectingdatabytheCensusBureauisthetraditionalsamplesurveyadministeredtohouseholds,businessesandgovernmententities.Responseratestosamplesurveys,however,aredecliningrapidlywhilesurveycostsareclimbingatanunsustainablerate.Further,theCensusBureauishighlyinterestedinreducingpublicburden(theeffortandtimespentrespondingtosurveys).ThiscombinationofchallengesisdrivingtheCensusBureautoexplorealternativedatacollectionmethods,includingpassiveapproachesusingBigDatatechniques.Theoverallissuethisprojectwillhelpaddressishowtocombinethedatafromnewsourceswithdatacollectedusingthetraditionalapproaches.Thegoalistoincreaseefficiencyandtimeliness,reducecostandburden,andmaintaindataquality.ScopeofWork:Aparticularlineofsurveyquestionsthathaveproventobeproblematicarethosesurroundingthetopicofpeople’stravelandcommutinghabits.Withtheavailabilityoftrafficandcommutingdatafrompublicandprivatesources,aswellascrowd-sourcedapplications(suchasWaze)theCensusBureaushouldbeabletoaccessdatathatmayenableareductionintravelandcommutingquestionsinoursamplesurveys(suchastheAmericanCommunitySurvey).ThisprojectwouldresearchpossibledatasourcestoinferselectedtravelandcommutingstatisticsintheAmericanCommunitySurvey,suggestmethodsandtacticsforaccessingandinterpretingthosesources,andexploreapproachesforverifyingthequalityofthesesourceswiththegoalovertimeofincorporatingthemintoofficialstatistics.Possibledatasourcesincluderoadwaytrafficcounts,traveltime,parking,publictransportation,etc.Workalsoinvolvesdevisingmethodstotestnewdatasourcequalityagainstestablishedsurveydatabenchmarks,aswellassuggestingaviablelong-termplanfortransitioningtonewdatasources.ExpectedDeliverables:

• Researchandrecommendationsonpossiblenewdatasourcesto

complement/replacetargetedquestionsonexistingsamplesurveyssuchastheAmericanCommunitySurvey

Page 2: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

CAD11-30-15

• Literaturereviewofexistingworkonincorporatingnewdatasourcesintotravelsurvey

• Documentationofmethods,technology,cost-benefits,andtacticsforaccessingandinterpretingnewdatasources

• Suggestedapproachesforverifyingthequalityofnewdatasources• Documentationofapossiblelong-termplanfortransitioningtonewdatasources

SkillsRequired:DataAnalytics/DataScienceSurveyMethodologyInformationTechnology/SystemDevelopmentAdvisoryBoard:MichaelThieme,ChiefoftheCenterforAdaptiveDesign,U.S.CensusBureauBenjaminReist,AssistantCenterChief,U.S.CensusBureau,CenterforAdaptiveDesign,U.S.CensusBureauStephanieCoffey,MathematicalStatistician,CenterforAdaptiveDesign,U.S.CensusBureauDeadline:12/5/2015FacultyAdvisor:SeanQianProposercontactinfo:Name,email,phone#Clientpointofcontact:[email protected](301)763-9062

Page 3: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

CapstoneReport

Spring2016

Advisor

SeanQian

Team

ManikandanPalani|NishaRao|SahilAggarwal|YifeiJiang|YunFu

Page 4: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 1|52

TableofContents

ExecutiveSummary......................................................................................................................................3

Introduction.................................................................................................................................................3

Background..............................................................................................................................................3

Goal..........................................................................................................................................................4

ProjectMethodology...................................................................................................................................4

Task1:ACSquestion“Howdidthispersonusuallygettoworklastweek?”...........................................5

Task2:ACSquestion“Howmanyminutesdiditusuallytakethispersontogetfromhometoworklastweek?”...................................................................................................................................................10

LessonsLearned...........................................................................................................................................7

SuggestionsforFutureWork........................................................................................................................8

Acknowledgement.......................................................................................................................................8

AppendixI:LiteratureReview......................................................................................................................8

NationalHighwayTravelSurveyDataAnalysis........................................................................................8

FactorsaffectingtraveltimeinUS...........................................................................................................9

BusTransitTime.....................................................................................................................................10

INRIXTrafficData...................................................................................................................................11

UrbanCongestionLevel.........................................................................................................................11

AppendixII:InformationonExternalDataSources...................................................................................13

NHTS:NationalHouseholdtravelsurvey...............................................................................................13

APCAVL-Automaticpassengercounterandautomaticvehiclelocationdata.....................................14

PittsburghParkingTerminals...................................................................................................................1

INRIX.........................................................................................................................................................2

AppendixIII:DataCleaningSteps.................................................................................................................4

APCAVL:Automaticpassengercounterandautomaticvehiclelocationdata........................................4

PittsburghDowntownparkingterminals.................................................................................................4

INRIX.........................................................................................................................................................5

AppendixIV:Models....................................................................................................................................6

LogisticRegression...................................................................................................................................6

MultinomialLogisticRegression...............................................................................................................6

ArithmeticMean......................................................................................................................................6

SimpleLinearRegression.........................................................................................................................6

Page 5: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 2|52

AppendixV:Code.........................................................................................................................................7

Codeusedtopredictmodeoftransport:Pittsburgh_Mode_Choice.R....................................................7

CodeusedtogettheFIPScodeofstopfromtheCensusAPI:getStopFIPS.py......................................13

Codeusedtofiltermeantraveltimefromtractleveltocountylevel:filter.py....................................14

Codeusedtocalculatethemeantime:calculateMeanTT.py................................................................15

Codeusedtopredictthetravel-time:travel_time.R.............................................................................17

References..................................................................................................................................................18

Page 6: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 3|52

ExecutiveSummary

This report is oneof the deliverables associatedwith theHeinz CapstoneProject programat Carnegie

MellonUniversity.Inthisprogram,studentsaregroupedinteamsoffiveandassignedaclientforwhom

theywillundertakeaninitiativeoverthecourseofasemester.

OurclientforthisprojectistheCentreforAdaptiveDesignoftheUnitedStatesCensusBureau.TheU.S

CensusBureauistheprincipleU.SFederalStatisticalsystemintheU.S.

OurgoalwastohelpreducethecostofconductingtheAmericanCommunitySurvey.Thisreportisdivided

intoseveralsectionsthatdetailthestepstakenbytheteamtodevelopasolutionthathelpsachievethe

goal.Firstly,itprovidesabriefbackgroundontheteam’objective.Secondly,itdetailsthemodelbuiltby

CMU Capstone team to estimate response to two of the fivework-commute related questions in the

American Community Survey (ACS) using historical ACS data and data from external sources – INRIX,

PittsburghParking.Theteamcomparedtheaccuracyoftheresultsofthemodelusinga20%,40%,60%

and80%samplingplan toobserve theutilityof themodel in scenarioswhereonly20%or40%of the

existingrespondentsweretobecontacted.ThemodelcouldhelptheBureautoreducethenumberof

peoplesurveyedyetobtainreliablyaccurateresponsesfortherestoftheintendedtargetpopulationand

subsequentlyreducesurveycosts.Thirdly,thereportdetailsthefutureworkofincludingfactorssuchas

congestionandweatherthatcouldhelpfurtherrefinethemodel.

Introduction

Background

US Census Bureau conducts the American Community Survey (ACS) on an ongoing basis to collect

information about the American people and their community. This data is used by planners and

entrepreneurstoassessthepastandplanthefuture.Italsohelpsdeterminethedistributionofmorethan

400$billioninfederalandstatefundseachyear.

Thesurveyisadministeredthroughseveralwaystooneinthirty-eightU.Shouseholdsperyear.Theform

can be completed online or on a paper form.U.S Census Bureau has a follow-up procedure in case a

respondentdoesnotrespondwithinastipulatedtimeframe.Thenon-respondenteitherreceivesacallor

apersonalvisitfromtheCensusstafftohelpcompletethesurvey.

Thecostofconductingthesurveyisincreasingwhilethesurveyresponserateisdeclining.ACScurrently

has seventy-seven questions. To reduce respondent burden, Centre for Adaptive Design of US Census

Bureauisresearchingwaystoreducesurveycost.

Ofitsseventy-sevenquestions,ACShasfivequestionsthatcaptureinformationaboutwork-commuteof

therespondents.Thesequestionshaveprovedtobeveryproblematicwhenitcomestoelicitingresponses.

Page 7: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 4|52

Goal

UndertheguidanceofSeanQian,CMUProfessor,ourgoalofthisprojectwastoaccesswaystoinferthe

travelandcommuterelatedquestionsinACS.

ProjectMethodology

Ofthefivework-commuterelatedquestionsinACS,theteamconcentratedontwoquestions-“Howdid

thispersonusuallygettoworklastweek?”and“Howmanyminutesdiditusuallytakethispersontoget

from home to work last week?”. The objective was to build models to estimate responses to these

questions.

So,theprojectmethodologyhasbeendetailedwithrespecttothetwotasks.

Task1:ACSquestion-“Howdidthispersonusuallygettoworklastweek?”

Summary: The team through its initial research (Appendix I: NHTS Data Analysis) identified the

methodologytodevelopamodeltopredictthemodeoftransportaspublicorprivate.Theexplanatory

variablesforthemodelwereinitiallyidentifiedfrombothNHTSandACSdatasources.

NHTSdataisdetailedandcananswermanyquestionsdirectlybutitslatestdataisfromyear2009.

NHTSisnotconductedasfrequentlyasACS.Duetothisfact,theutilityofthemodelwaslimited.More

detailsareprovidedin"StepI:PredictivemodelusingNationalHouseholdTravelSurveydata"

TheteamthenidentifiedequivalentofNHTSvariablesinACSdata.Theteammembersdeveloped

amodeltopredictthemodeoftransportaspublicorprivateusingsolelythevariablesfromACSdata.They

refinedthemodelfurthertobeabletoidentifythetypeofpublicorprivatemodeoftransporttakenbya

respondent.Theutilityofthismodelwastestedusing20%,40%,60%and80%randomsamplingplan.More

detailsareprovidedin"StepII:PredictivemodelusingPUMSdataofAmericanCommunitySurvey"

Task2:ACSquestion-“Howmanyminutesdiditusuallytakethispersontogetfromhometoworklast

week?”

Summary:UsingtheCommutinginAmericaIIIreport(AppendixII:FactorsaffectingtraveltimeinUS),the

team identified intuitively the variables that could affect the work-commute travel time. The team

membersalsoresearched(AppendixIII:BusTransitTimeandAppendixIV:INRIXTrafficData)onhowto

makeuseoftheexternaldatasourcessuchasBustransitdataandINRIXtrafficspeeddatainthemodelto

predicttravel-time.

The team faced challengeswith respect to data granularity and overcame themby using data

cleaningmethodsandaggregatingthedataattheappropriatelevel.Theteamalsohadtomakequiteafew

assumptionsinordertomakeuseoftheexternaldatasets.Themembersdevelopedamethodtocalculate

meantraveltimeforbustransitatblocklevel.Moredetailsareprovidedin"StepI:Meantraveltimefor

Publictransportation-Businagivenblock".

Thedetailsof the travel-timepredictingmodel areprovided in "Step II: Predictivemodelusing

historicaldatafromACSalongwithinputsfromexternaldataset".

Page 8: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 5|52

Theteamalsodidsomeresearchoncalculatingthecongestionlevelinurbanareas(AppendixV:

Urban Congestion Level). As per the “Commuting in America III report”, travel time is an attribute of

commuters whereas congestion is an attribute of facilities. Therefore, measures of travel times and

measuresofcongestiondonotnecessarilyconverge.Hencethecongestionfactorhasnotbeendelvedinto

detailandcurrentlynotbeenincorporatedinthetravel-timemodel.

Thedetailsofeachstepineachtaskaredescribedbelow.

Task1:ACSquestion“Howdidthispersonusuallygettoworklastweek?”

ResponseoptionsinACS:

• Car,truck,orvan

• Busortrolleybus

• Streetcarortrolleycar

• Subwayorelevated

• Railroad

• Ferryboat

• Taxicab

• Motorcycle

• Bicycle

• Walked

• Workedathome

• Othermethod

StepI:PredictivemodelusingNationalHouseholdTravelSurveydata

Summary:Thismodelwasbuilttoinitiallyclassifythepopulationintermsofwhetheranindividualuseda

publicoraprivatemodeoftransporttocommutetowork,andoncethatmodelshowedpromise,todig

deeperintothepredictionbybeingabletoalsoidentifywhichparticularmodeoftransportwasused(for

e.g.car,bicycleetc.inprivatevehiclesandbus,trainetc.inpublicmodesoftransport).

Tool:RStudio

InputData:

Source:NationalHouseholdTravelSurvey

Details:

Year Level #ofvariables #ofrecords

2001 National 142 70k

2009 National 150 75k

Data Cleaning Step: Some of the input variables were non-binary categorical variables. The logistic

regression model cannot accept non-binary categorical variables. These variables were converted to

relevantdummyvariablesforuseinthepredictiveprocess.Fore.g.Thevariablefor‘Race’cantakeupto9

differentvalues.Soatleast8dummyvariableswereneededforeachvaluethatthevariablecouldassume.

Aftertakingthedummyvariablesintoaccount,wehadatotalofaround160variables.Thevariablesthat

didnotrealisticallyholdanypromisewereweededouteitherduetotheir inherentdefinitionordueto

their infrequency in the actual data. After removing the variables that had almost zero frequency of

occurrence,theshortlistedvariablestotaledinthe60-80range.

Page 9: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 6|52

ExplanatoryVariables:

Ø Age

Ø Sex

Ø Race

Ø Region

Ø Urban housing

status

Ø Block housing

status

Ø MSAsize

Ø Educationlevel

Ø Hispanicindicatorpopulation

density

Ø Totalincome

Ø Householdincome

Ø Renterpercentage

Ø Workerstatus

Ø Numberofvehicles

Ø Homeownership

status

Ø Placeofbirth

Selectionlogic:DetailsareprovidedinAppendixI:NHTSDataAnalysis

PredictedVariable:Modeoftransport–PublicorPrivate

Model:GLMbasedlogisticclassifier

ln #$1 − #$

= )* + ),-,,$ + )/-/,$ + ⋯+)1-1,$

whereln #$1 − #$

= logoddsofprivatemodeoftransportused

-,,$ =predictorvariablei

Code: “Details in Appendix V under section “Code used to predict mode of transport:

Pittsburgh_Mode_Choice.R”

Output:

Ø 93%predictionaccuracyfor2001NHTSdataat0.5cutoffasadecisionrule

Ø 95%predictionaccuracyfor2009NHTSdataat0.5cutoffasadecisionrule

Validation:10-foldcrossvalidationmethodwasusedtovalidatetheresults.Ithadanaverageaccuracyrate

of93%.

Challenge:

Ø Highlyskewednatureofdata-92%peopleuseprivatemodeoftransport

Ø Trade-offbetweenaccuracyandthenumberoffalsepositivesfortweakingthedecisionruleand

increasingitbeyond0.5

Theoverall rateofpublic transportuse in theNHTSsamplewasaround7%whichwasroughlyour

model’smisclassificationrate.Thishappenedduetotheinherentlyskewednatureofthedata,where~93%

ofthepopulationusedaprivateformoftransporttocommutetoworkwhereasonly7%usedapublic

modeoftransport.Furthermore,ourmodel’sROCcurveshowed26%greaterrecallthanaflipping-of-coin

basedmodel.

Conclusion: The most important and most correlated variables (with the mode of transport used to

commutetowork)areAge,Sex,EducationlevelandTotalIncome-allofwhichmadeperfectintuitivesense.

Page 10: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 7|52

TheNational Household Travel Survey data can be used to estimate cannot be used as an alternative

source.

StepII:PredictivemodelusingPUMSdataofAmericanCommunitySurvey

Summaryofthemodel:Thismodelwasbuilttoclassifythepopulationintermsoftheparticularmodeof

transport(fore.g.car,bicycle,train,etc.)usedtocommutetowork.

Tool:RStudio,AmazonWebService

InputData:AmericanCommunitySurvey’shistoricaldata

TheAmericanCommunitySurvey (ACS)PublicUseMicrodataSample (PUMS) filesarea setof

untabulatedrecordsaboutindividualpeopleorhousingunits.TheCensusBureauproducesthe

PUMSfilessothatdatauserscancreatecustomtablesthatarenotavailablethroughpretabulated

(orsummary)ACSdataproducts.

DataCleaningStep:

Ø TheteamextractedthePittsburghCitydatafromthewholenationalwidedatasetbasedonthe

PUMAcodeofPittsburghCity(01701and01702).

Ø Forthemissingvalues,wereplacedthemwith0or01accordingtocertaincircumstances.

ExplanatoryVariables:

Thereare512variablesinthedataset,theteamthenselectedthevariablesthatmostinfluence

whichtransportmodethepersonismostlikelytouseforbothPittsburghCityandU.S.national

wide.

VariablesforPittsburghCity:

Name Explanation

JWMNP Traveltimetowork

FMRGIP Firstmortgagepaymentincludesfire,hazard,floodinsuranceallocationflag

MIG Mobilitystatus

PINCP Totalperson'sincome

COW Classofworker

JWRIP Vehicleoccupancy

INTP Interest,dividends,andnetrentalincomepast12months

FLANXP LanguageotherthanEnglishallocationflag

PWGTP Person'sweight

SPORDER PersonNumber

PERNP Totalperson'searnings

HINS2 Insurancepurchaseddirectlyfromaninsurancecompany

Page 11: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 8|52

TYPE Typeofunit

FS Yearlyfoodstamprecipiency

RACASN Asianrecode(Asianaloneorincombinationwithoneormoreotherraces)

RNTM Mealsincludedinrent

JWTR(Target) Meansoftransportationtowork

VariablesforU.S.NationalWide:

Name Explanation

JWMNP Traveltimetowork

POWSP Placeofwork-Stateorforeigncountryrecode

HINS7 IndianHealthService

Wgtp HousingWeightreplicate3

REFR Refrigerator

RACBLK BlackorAfricanAmericanrecode(Blackaloneorincombinationwithoneormoreother

races)

MHP Mobilehomecosts(yearlyamount)

SSP SocialSecurityincomepast12months

PUMA Publicusemicrodataareacode(PUMA)

OCCP Occupationrecode

POWPUMA PlaceofworkPUMA

FINDP Industryallocationflag

SMOCP Selectedmonthlyownercosts

HINCP Householdincome(past12months)

OCPIP Selectedmonthlyownercostsasapercentageofhouseholdincomeduringthepast12

months

JWTR Meansoftransportationtowork

SelectionLogic:Sameastheonementionedintheprevioussection.

PredictedVariable:Modeoftransport–Car,Bus,Streetcar,Subway,Railroad,Ferryboat,Taxicab,

Motorcycle,Bicycle,Walked,Workathomeorothermethod

Model:

MultinomialLogisticRegressionisthelinearregressionanalysistoconductwhenthedependentvariableis

nominal with more than two levels. Thus it is an extension of logistic regression, which analyzes

dichotomous(binary)dependents.

Page 12: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 9|52

Also,inordertopreventtheoverfittingproblem,weconducted10-foldercrossvalidation.Cross-validation

isamodelvalidationtechniqueforassessinghowtheresultsofastatisticalanalysiswillgeneralizetoan

independentdataset.Itismainlyusedinsettingswherethegoalisprediction,andonewantstoestimate

howaccuratelyapredictivemodelwillperforminpractice.Inapredictionproblem,amodelisusuallygiven

adatasetofknowndataonwhichtrainingisrun(trainingdataset),andadatasetofunknowndata(orfirst

seendata)againstwhichthemodelistested(testingdataset).Thegoalofcrossvalidationistodefinea

datasetto"test"themodelinthetrainingphase(i.e.,thevalidationdataset),inordertolimitproblems,

likeoverfitting,giveaninsightonhowthemodelwillgeneralizetoanindependentdataset.

Code:

DetailsinAppendixVundersection“Codeusedtopredictmodeoftransport:

Pittsburgh_Mode_Choice.R”

Output:

PittsburghCity

U.S.NationalWide

Validation:

Duetothehighlyskewedoriginaldataset(overall95%ofthehouseholdsusecartocommute),theteam

hastorebalancethedataset.Wespecificallyextractedhalfofthedataascommuterswhousecarandthe

resthalfascommuterswhodon'tusecar.

ForPittsburghCity,theteamthenrandomlysubsetthedatasetintosamplesof20%,40%,60%and80%

and ran simulation 1000 times to compare the accuracy. However, for U.S. National wide, even if we

deployedAWSandlaunchedthelargestinstance(x10.Large)onit,westillcouldn'trunthewholenational

dataset.Soalternatively,werandomlysubsetthedatasetintosamplesof10%,20%,30%,40%and50%,

andransimulationonly1timestocomparetheaccuracy.

Page 13: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 10|52

Challenge:Largevolumeofdata

Conclusion:

ForPittsburghCitydata,theaverageaccuracyafter1000timessimulationrunsshowsthatpartial

samplingcanleadtoroughlylesseraccuracy.However,forU.S.NationalWidedata,wecannot

seeaclearpattern.Webelievethatthereasonbehindofthisphenomenonisthelackofsimulation

times.Giventhebudgetandtimerestriction,werealizedthatthereisalimitofcomputerpower.

Task2:ACSquestion“Howmanyminutesdiditusuallytakethispersontogetfromhome

toworklastweek?”

StepI:MeantraveltimeforPublictransportation-Businagivenblock

Summary:Usethearithmeticmeanformulatocalculatethemeantraveltime

Tool:ExcelandPython

Input:APCAVL(Automaticpassengercounterandautomaticvehiclelocation)data

Year Level Location

September2013 City Pittsburgh

DataCleaningStep:Eliminatemissingvaluesbyfilteringoutthoserecordsfromtheinput.

Inputvariables:

Ø PeoplegettingonthebusatastopØ PeoplegettingoffthebusatastopØ CurrentloadinthebusØ StopIDØ BlockIDorFIPScodeØ Timetakenbetweenthestops

Ø Dwelltimes

SelectionLogic:Majorvariableswereuseddirectlyorindirectlyforcalculatingmeantraveltime.Therewas

novariableselectionmethodsused.Thisisdirectcalculation.

PredictedVariable:meantraveltimeperpersoninagiventract(FIPScode)

Model:Formulaforcalculatingthemeantraveltimeisgivenbelow.

Page 14: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 11|52

MTT:Meantraveltimeofpeoplewhousepublictransportinaparticularblock

n:Numberofpeoplewhogetonthebus

t:Differenceintimetakenbetweengettingonthebusandgettingofthebus.

Code:Pythoncodefiles:CodeisinAppendix.

Ø getStopFIPS.py:thisistogettheFIPScodeofstopfromtheCensusAPI

Ø filter.py:Filteredmeantraveltimefromtractleveltocountylevel

Ø calculateMeanTT.py:Thisistocalculatethemeantime

Assumptions:Ø Locationatwhichapassengerboardsisassumedtobethatpassenger’shome/worklocationin

thatparticulartract.

Ø Thismaynecessarilynotbetrue.Itisimportanttorealizethatapersoncangetinandgetdownat

randomstopsaccording tohiswish.Though therearepatterns in thedata, there isnoway to

determinethesourceorthedestinationofapassengerfromthisdata.Thisaffectstheaverage

traveltimecritically.Thereisnodatatoidentifythepersoninquestionorforwhatpurposethe

persontravelsinthepublictransport.Meantraveltimecannotbecomputedatanindividuallevel.

Ø FIFOassumptionvisualized

Figure8section2

Output:

Meantraveltimevisualized

Page 15: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 12|52

Figure9Section2

Figure10Section2–ThevisualizationofmeantravelforbustransportatblocklevelforPittsburgh

Page 16: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 13|52

Challenge:

Ø ThebustransitdataonlyhasstopIDnumbersandstopnamewithnocoordinates.

Ø FindingthecoordinatesforthestopIDwasadifficulttask.TheteamsearchedseveralAPIs

includinggooglemapsandfinallyfoundthevaluesinPittsburghcityGISdata.

Ø JoinedthePittsburghcityGISdataandthebustransitdatausingstopIDasprimarykey

andobtainedthecoordinatesforthestop.

Ø Once the coordinateswereobtained, the coordinateswereused to get the FIPS code,

whichidentifiestheblocktowhichthestopbelongs.

Ø AllthebusstopswerethenassignedtotheirrespectiveblockusingFIPScode.

Ø ThestopsweregroupedaccordingtoFIPScode.

Ø Oncethisgroupingwasdone,averagetraveltimeforeachFIPScodewascalculatedusing

thepythoncode:calculateMeanTT.py

Conclusion:

Though,wemakeseveralassumptionstherelativemeantraveltimevaluesfordifferentblockscanstillgive

usagoodideaaboutthetrendsintraveltimeacrossblocks.

Forinstance,Meantraveltimeisstopsindowntownandquitelongforstopsinruralareas.Belowfigure

shows,thevisualizationRedindicatesmoretime.Blueindicateshighertime.

Page 17: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 1|52

StepII:PredictivemodelusinghistoricaldatafromACSalongwithinputsfromexternaldataset

Summary:Theteamtriedtoanswerthesequestionsbybuildingalinearregressionmodelbycombining

variablesfromACSdataandtheExternalDataSources.Themainmotivewastofindtheeffectofadding

ExternalDataSourcevariableontheACStraveltimeforalltractswithinPittsburgh.

Tool:RStudio,WEKA

Input:

DataSource TimePeriod Level Location #ofrecords #ofvariables

ACS 5year

estimates

CensusTract Pittsburgh 22 33

Pittsburgh

ParkingTerminals

2013and2014 City Pittsburgh 1

APCAVL 2013and2014 City Pittsburgh 1

DataCleaningStep:

Inputvariable:DetailsaboutcalculationofMeantraveltimeforpublictransportationismentionedinthe

previoussection.

Variablename Description(Allvariablesareintractlevel)

TRACTCE10 TRACTID

PublicMTT Pittsburghpublictransportmeantraveltimecalculatedfromexternal

dataset

ParkingTransactions NumberofparkingtransactioninPittsburgh

PopulationTotal Totalpopulation

IncomePerCapita Percapitaincome

MeansofTransportTotal Totalnumberofvehicles

Cartruckvan Numberofcars,trucksorvan

DroveAlone Numberofpeoplewhodrovealone

Carpooled NumberofPeoplewhocarpooled

Publictransport NumberofPeoplewhousedpublictransport

Bicycle NumberofPeoplewhousedbicycle

Walked Numberofpeoplewhowaked

Othermeans Numberofpeoplewhousedothermodesoftransport

Workedathome Numberofpeoplewhoworkedfromhome

Lessthan5minutes Numberofpeoplewhotooklessthan5minutestotraveltowork

5to9minutes Numberofpeoplewhotookbetween5to9minutestotraveltowork

10to14minutes Numberofpeoplewhotookbetween10to14minutestotravelto

work

15to19minutes Numberofpeoplewhotookbetween15to19minutestotravelto

work

Page 18: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 2|52

20to24minutes Numberofpeoplewhotookbetween20to24minutestotravelto

work

25to29minutes Numberofpeoplewhotookbetween25to29minutestotravelto

work

30to34minutes Numberofpeoplewhotookbetween30to34minutestotravelto

work

35to39minutes Numberofpeoplewhotookbetween35to39minutestotravelto

work

40to44minutes Numberofpeoplewhotookbetween40to44minutestotravelto

work

45to59minutes Numberofpeoplewhotookbetween45to49minutestotravelto

work

60to89minutes Numberofpeoplewhotookbetween60to89minutestotravelto

work

90ormoreminutes Numberofpeoplewhotookbetween90minutestotraveltowork

White Raceidentifier

AfricanAmerican Raceidentifier

Asian Raceidentifier

Other Raceidentifier

Male Numberofmales

Female Numberoffemales

AvgTT Averagetraveltime

Selection Logic: This is detailedunder the section “Factors affecting travel time inU.S” in “Appendix I:

LiteratureReview”.

Predictedvariable:Traveltimeattractlevel

Model:Linearregression

Code: Lm (formula = X$Avg_TT ~X$Public_MTT + X$Parking_Transactions + X$Population_Total +

X$Means_of_Transport_Total,data=X)

Output:

Belowistheresultoftheregressionmodelwithoutexcludingthehighlycorrelatedvariables.

Page 19: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 1|52

Figure11Section2

ItcanbeclearlyseenthatRsquareishigh.RsquareexplainsthevariabilityinthemodelandaHighRsquare

value(closeto1)impliestherearemanyvariablesthatareexplainingthevariationinthemodel.Pvalue

which denotes the significance is also lesser than 0.05. Therefore, we decided to exclude the highly

correlatedvariablesandruntheregressionmodelonlywiththeessentialvariables.Thetablebelowshows

thehighlycorrelatedvariablesthatwereexcluded.

Page 20: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 1|52

Figure12Section2

OutputofRegressionrunsafterexclusion:

Figure13Section2

GoodnessofFit:

Consideranylinearregressionmodel,whichlookslikethefollowing

Are all the assumptions - Normality of residuals or errors from the model, constant residual variance

throughouttherange–true?Itismostlyunlikely.Itisonlyplausiblethattheassumptionsarecloseenough.

Goodnessoffitdefineshowcloselytheassumptionsforthemodelareusefulinpractice.

Thereareseveralmeasuresforgoodnessoffittheyare:

• Examiningresidues

• AglobalmeasureofvarianceexplainedbyR2

• AglobalmeasureofvarianceadjustedfornumberofparametersinthemodelcalledadjustedR2

Residualscanbeuseddescriptively,bylookingateitherhistogramsorscatterplotsofresiduals.Consider

thefollowingmodel-

TheBCDresidualfortheBCDobservationisgivenby

WhereYiistheobserveddependentvariableandXijistheobservedcovariatesfortheBCDobservation.Nowletusplotthegraphandseetheoutcomes.

Page 21: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 2|52

Fit = lm(X$Avg_TT~

X$Public_MTT+X$Parking_Transactions+X$Population_Total+X$Means_of_Transport_Total)

>hist(fit$residuals,main="Histogramforresiduals",xlab="residuals")

>plot(X$Population_Total,fit$residuals)

>plot(X$Public_MTT,fit$residuals)

>plot(X$Parking_Transactions,fit$residuals)

>plot(X$Means_of_Transport_Total,fit$residuals)

Page 22: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 3|52

Page 23: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 4|52

Thehistogramisnotstronglynormalinshape,but,thenagain,therearejust22Observations.Scatterplots

lookreasonable,i.e.,noobviousdeparturesfromConstantvarianceorlinearity.

R2:Aglobalmeasureof“varianceexplained”

R2valueisrepeatedlyusedinregressiontoexplainthefit.Wewillnowseeexactlyhowtointerpretthis

measure.Considerfirstsimplelinearregression. Weareusually interestedinwhethertheindependent

variableisworthhaving,soreallywearecomparingthemodel

Y=α+βXtothesimplermodelY=α,usingtheinterceptonly.

Wedefine:

R2=1–{(sumofsquaredresidualsfrommodelwithαandβ/(sumofsquaredresidualsfrommodelwithαonly)}

Or

R2=1–{SS(res)/SS(total)}

whereSS(res)(oftenalsoreferredtoasSSEor“sumofsquaresoferror”)isdefinedasthesumofresidual

distancesfrommodelwithX,andSS(total)isthesame,butfromtheinterceptonlymodel.

Bydefinitionofleastsquaresregression,SS(res)≤SS(total),becauseifthebestregressionlinewasreally

usingαonly,thenSS(res)=SS(total),andinallothercases,addingβimprovesSS(res).

So,0≤R2

≤1.IfSS(res)=SS(total),thenR2

=0,andmodelisnotuseful.IfSS(res)=zero,thenR2=one,

andmodelfitsallpointsperfectly.Almostallmodelswillbebetweentheseextremes.

Page 24: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 5|52

Therefore,SS(res)showshowmuchcloserthepointsgettothelinewhenβisused,comparedtoaflatline

usingαonly(whichisalwaysY=α=Y).

Becauseofthis,wecancallR2the“proportionofvarianceexplainedbyaddingthevariableX”.

Essentiallythesamethinghappenswhenthereismorethanoneindependentvariable,exceptresidualsare

fromthemodelwithallXvariablesforthenumeratorinthedefinitionofR2.

Thus,R2

givesthe“proportion

ofvarianceexplainedbyaddingthevariablesX1,X2...Xp,iftherearepindependentvariablesinthemodel.

HowlargedoesR2

needtobetobeconsideredas“good”?Thisdependsonthecontext;thereisnoabsolute

answerhere.ForhardtopredictYvariables,smallervaluesmaybe“good”.Overall,R2providesauseful

measureofhowwellamodelfits,intermsof(squared)distancefrompointstothebestfittingline.However,

asoneaddsmoreregressioncoefficients,R2nevergoesdown,eveniftheadditionalXvariableisnotuseful.

Inotherwords,thereisnoadjustmentforthenumberofparametersinthemodel.

AdjustedR2

Inasimplelinearregression,pisthenumberofindependentvariables.

Itp=1thenAdjustedR2

=R2

.Asthenumberofparametersincreases,AdjustedR2

≤R2

.

Withthisdefinition:

R2=1–{(n−1)×sumofsquaredresidualsfrommodelwithαandβ/(n−p)×sumofsquaredresiduals

frommodelwithαonly}

Therefore, there issomeattempttoadjust for thenumberofparameters.Letusseethis forourmodel

below.

Call:

lm(formula=X$Avg_TT~X$Public_MTT+X$Parking_Transactions+

X$Population_Total+X$Means_of_Transport_Total)

Residuals:

Min1QMedian3QMax

-1.80948-0.614350.077580.522862.60368

Coefficients:

EstimateStd.ErrortvaluePr(>|t|)

(Intercept)9.301e+007.183e-0112.9503.11e-10***

X$Public_MTT7.227e-012.826e-012.5570.0204*

X$Parking_Transactions-4.102e-051.912e-05-2.1450.0467*

X$Population_Total-4.064e-033.614e-04-11.2452.70e-09***

X$Means_of_Transport_Total7.996e-036.955e-0411.4961.93e-09***

Page 25: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 6|52

---

Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’1

Residualstandarderror:1.155on17degreesoffreedom

MultipleR-squared:0.9006, AdjustedR-squared:0.8772

F-statistic:38.49on4and17DF,p-value:2.607e-08

Rreport:R2

=0.9006andAdjustedR2

=0.8772,so,ineithercase,about90%ofthetotalvarianceisexplained

bythevariablesused,whichisveryhigh.Atleastbythesemeasures,themodelfitswell.

Challenge:

Weinitiallycollectedthetractleveldataforthe33selectedvariablesandcombinedthemintoonefile.138

tractsfallunderAllegany-Pittsburghborders.

Oncethisdatawascollated,weusedtheTRACTIDandTRACTnameaskeytocombinetheExternalData

SourcevariablestotheACSdataset.

ParkingdataandAPCAVLdataarecollectedbydifferentauthorities.Theydonothaveacommonground

forcombining.Whenthedatasetsarebroughttoacommonlevelforexample,attractlevelitresultsin

missingdata.

ThemainreasonforthisisparkingdataisnotavailableforalltractsinPittsburghi.e.parkingspotsarenot

spreadacrossallareasinatractinPittsburgh.Similarly,buslinestravelacrossparticularroutesanddata

forsometractsarenotavailable.Whencombinedthisnaturallyresultsindatanotbeingavailableforsome

tracts.

Oncethedatasetswerecombineditresultedindataonlyfor22outof138Pittsburghtracts.

Thissamplesizeof22isextremelysmalltorunaregressionmodel.However,takingintoaccountthatthis

isonlyaproofofconceptwebuiltaregressionmodelusingR.

The regressionmodelwas initially runwith all the intuitively selected variables forACS alongwith the

externaldatasetvariables.

However,thismodelhadlargenumberofvariablesthatwerehighlycorrelatedandeffectsoftheexternal

datasetvariablescouldnotberealizedclearly.

Soweexcludedthehighlycorrelatedvariablesandranaregressionmodelonlywith4variables.2from

external data set. Publicmean travel time and number of parking transactions per tract and the ACS

variables.Totaltractpopulationandnumberofvehicles inatract.Theresultsarediscussedinthenext

page.

Conclusion:

Ø ThefirstRsquarevalueintheregressionresultsafterexcludingthesignificantvariables

is very high. Itmeans the existing variables in themodel explain the variability in the

Page 26: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 7|52

externaldatasetverywell.Thep-valueisalsolessthan0.05anditmeansthevariables

aresignificant.

Ø Modeoftransportforacommutercanbepredictedwithjustafractionofthecurrent

sampleavailabletoagooddegreeofaccuracy.

Ø Meantraveltimeforpeopleusingbusaspublictransportationhasbeencalculatedfor

blocklevel.Othercalculations,involvingdatafromexternaldatasource,includeparking

transactionsforall tractswithinPittsburghandaveragespeeddataandaveragetravel

time.

Ø UsingexternalandexistingACSdatameantravel timeforPittsburghtract level iswell

predicted. External data sets are not uniform. So, data needs to be cleaned and

normalized.Normalizationmayresultinaverysmallsample.Asaresult,thiscanonlybe

used as a proof of concept, further research is encouraged and results need to be

scrutinized.

Conclusion

Ø Mode of transport for a commuter can be predictedwell, evenwith a fraction of the

currentsample.

Ø MeantraveltimeatPittsburghtractleveliswellpredictedusingacombinationofAPCAVL

andPAparkingwiththeACSdata(87%ofthevariabilityexplained)

LessonsLearned

Data

Realworlddataisextremelydifferentandmultifaceted.Datasetsarecomplexaswellaslargein

volume. Useful information must be extracted from raw data. Even after extracting useful

informationmissing values and normalization issues can occur. Large data sets further create

problems when it comes to computing. They require huge processing power and distributed

computing.Laptopsandminicomputerscannothandlesuchdatahencetheyneedtobemoved

tothecloud.Thisresultsinextracosts.Sobetterandmoreefficientcodeneedstobedeveloped.

Assumptions

Whilebuildingmodelsusingdatamining severalassumptions likeFIFO,Queueareused these

assumptionscangivemodelsthatareverygoodforcreatingproofofconceptsbutnotentirely

accuratemodels.Forbuilding,moreaccuratemodels,oneneedstodeveloporusecustomized

machinelearningalgorithmsbasedonthedataset.

Page 27: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 8|52

SuggestionsforFutureWork

Travelmodemodel can be expanded to national level using the nationwide datawithmore

computingpoweranddistributedcomputing.

Expandlinearregressionmodeltopredicttraveltimefor

Ø AlltractsofPittsburgh.

Ø Statewideandnationwideusingotherexternaldatasets.

Travel mode model can be further refined by including the congestion and weather as

independentvariables.

Acknowledgement

ThiscapstoneprojectwasundertakenbytheteamforCentreforAdaptiveDesign,USCensusBureau,with

support fromCarnegieMellonUniversity.We thankour advisor Professor SeanQian for his continued

supportandguidance.Wealsothankourclientfortheirpatienceandsupport.

Thisprojectallowedsomeofustohoneouranalyticalskillswhileothersgainedanunderstandingofthe

analytics field.We thank our Heinz college, CarnegieMellon University, for providing us this learning

opportunity.

AppendixI:LiteratureReview

NationalHighwayTravelSurveyDataAnalysis

Source: The Impacts of Socio-Economic and Demographic Shifts in Transit Served Neighborhoods OnModeChoiceAndEquitybyStevenApell

Page 28: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 9|52

Highlighttheimpactsofsocio-economicanddemographicchangesonTOD.TODstandsforTransit-oriented

development. It is amixed-use residential and commercial areadesigned tomaximize access topublic

transport,andoften incorporates features toencourage transit ridership.Thepaperalsoevaluates the

overalleffectivenessofTODpolicyinsupportingasustainablecommunity.

Answeredfollowingfourquestions:

Ø Howhasthesocio-economicanddemographiccharacterofcommunitieschangedinblockgroups,

whicharewithin0.5mileand0.5-1.0mileofTODandnon-TODstations?

Ø IsthepresenceofTODinterrelatedwiththeoccurrenceofgentrificationinblockgroupswithin0.5

and0.5-1.0mileoftransitstations?

Ø Howhavechangesinsocio-economicanddemographicfactorsinfluencedmodechoiceinblock

groupswithin0.5and0.5-1.0mileradiusofTODandnon-TODstations?

Ø Whatarethedifferencesbetweensocio-demographiccharacteristicsofcommunitieslivinginTOD

comparedwith communities innon-TOD; further,whatdifferencesexistbetweencommunities

within0.5milesofTODcomparedwithcommunitieslivingwithin0.5-1.0mileradius?

Helpedidentifythefollowingfactorsthataffectthemodeoftransportchosenbyacommutertotraveltowork:

Ø Percentageofhouseholdswithnocarormorethanonecar

Ø Medianandpercapitaincome

Ø Percentageofcollegegraduates

Ø Medianhousingvalue

Ø Mediancontractrent

Ø Percentageofownerandrenteroccupiedhousing

Ø Racialcategories(PercentagesforBlackandWhite)

Ø Percentageofcollegegraduates

Ø EmploymentType

Ø LengthofResidency

FactorsaffectingtraveltimeinUS

Source:CommutinginAmericaReportIII-TheThirdNationalReportonCommutingPatternsandTrendsbyAlanE.Pisarski

Thisreportexaminescommutingpatternsintermsoflongstandingtrendsandemergingfactorsthataffect

commutingeveryday.Itexaminesthedatatrendidentifiedthroughtheresponsestoeachtravelquestion

ofACSindetail.

Traveltimeisafunctionofbothspeedanddistance.Theincreaseinworktripdistanceswhenmatchedby

the increase inworktravelspeeds indicatesanactual improvement inworktravelspeeds.Travel times

increasingfasterthanthetraveldistancesmightindicatetheeffectofcongestioninthetraveltime.Travel

distances growing faster than travel speeds might indicate that the commuter prefers staying in a

neighborhoodwithreducedhousingcostandthisneighborhoodmightbeataconsiderabledistancefrom

hisworklocation.

Page 29: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 10|52

Ithasalsobeenobservedinpastthatwomenmakefrequentstopswhiletravellingandassuch,theirtravel

timewillbemore.Areassmaller in sizewill tend tosendat least someportionofcommuters toother

metropolitanareasforwork.Familieswhohavetoddlerstravellessandalot.

Ithelpedustointuitivelyidentifythefactorsthatcouldaffecttravel-timeofacommuter.Thevariablesidentifiedwereasfollows–

Ø Age

Ø Gender

Ø MaritalStatus

Ø Presenceandageofownchildren

Ø Gavebirthtochildwithinpast12months

Ø Frequentresidenceshifters

Ø Citizenshipstatus-categorical

Ø Education

Ø Income

Ø Numberofvehiclesinthehousehold

Ø Monthlyrent

Ø Numberofbreadearnersinhousehold

BusTransitTime

Source:ImpactofTrafficCongestiononBusTravelTimeinNorthernNewJerseybyClaireE.McKnight,HerbertS.Levinson,KaanOzbay,CamilleKamga,andRobertE.Paaswell

Thispaperdetails the impactofcongestiononthebustravel timerate. Italsoprovidesdetailsabouta

regressionmodeldevelopedtoestimatetraveltimerate(inminutespermile)ofabusasafunctionofcar

traffictimerate,numberofpassengershoardingpermile,andthenumberofbusstopspermile.

Itdetailsthemethodologyinthefollowingway-

Ø Giveabriefoverviewofthepreviousstudiesconducted

Ø Describethedatacollectionandinitialanalysisofdata

Ø Definethemodel

Ø Statetheconclusions

PredictBustraveltimeforaparticularroute,particularbusandparticulartimeslot.

Aroutesegmentisdefinedasasectionofroutebetweentwoadjacenttimepoints,withatimepoint(TP)

beingthelocationatwhichtheschedulehadarecordedtime.

Busspeedvariationbetweenstopscanindicatecongestion.A.mpeak–7amto10am;Midday–10am

to4pm;P.mpeak–4pmto7pm;Postp.mpeak–after7pm.

Factors Characteristics

Bus Ø #ofstopsØ StopspacingØ Dwelltimesatstops

Page 30: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 11|52

Ø #ofpassengersboardingØ #ofpassengersalighting

Route Ø SegmentlengthØ #oftrafficsignalsØ #ofleftturnsinroute

Traffic Ø Cartraveltimerate=trafficlevelØ #ofvehiclesinqueuewaitingtomakeleftturnØ #oftaxismakingsuddenstopsorturnstopickup/dropoffpassengersØ #ofvehiclesontheroad

Parking Ø #ofdoubleortripleparkedcars

INRIXTrafficData

Source:INRIXInterfaceGuideDecember2014

ThisreportprovidesdetailedinformationabouttheINRIXdata.Italsoactsasaguidetohelpunderstand

howtomakeuseofINRIXdata.

Itcontainsthefollowinginformation-

Ø InterpretingTMCcodes

Ø UnderstandingXDsegments,sub-segmentsandINRIX-managedsetfiles

Ø Integratinggraphicaltrafficdatainuserapplications

Ø Generatingspeedbuckets

Thedatadictionary included in this reporthelpedus tounderstand the variables in the inputdata file

containingtheINRIXtrafficspeeddata.

UrbanCongestionLevel

Source:2015UrbanMobilityScorecard

It is published jointly by the Texas A&M Transportation Institute and INRIX. This report details the

methodologyusedtoestimatethecongestionlevelatanurbanarea-widelevel.Thisallowsforcomparison

ofcongestionlevelinaconsistentwayacrossurbanareas.

Itusesacombinationoftrafficvolumedataalongwiththetrafficspeeddatatocomputethecongestion

levelmeasures.Allthemeasuresandmanyoftheinputvariablesforeachyearandeverycityareprovided

inaspreadsheetthatcanbedownloadedathttp://mobility.tamu.edu/ums/congestion-data/.Theroadway

inventorydatasourceformostofthecalculationsistheHighwayPerformanceMonitoringSystemfromthe

FederalHighwayAdministration.TrafficvolumedataissourcedfromINRIX.

Thefollowingstepswereusedtocalculatethecongestionperformancemeasuresforeachurbanroadwaysection.

Ø ObtainHPMStrafficvolumedatabyroadsection

Ø MatchtheHPMSroadnetworksectionswiththeINRIXtrafficspeeddatasetroadsections

Page 31: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 12|52

Ø Estimatetrafficvolumesforeachhourtimeintervalfromthedailyvolumedata

Ø Calculateaveragetravelspeedandtotaldelayforeachhourinterval

Ø Establishfree-flow(i.e.,lowvolume)travelspeed

Ø Calculatecongestionperformancemeasures

Ø Additionalstepswhenvolumedatahadnospeeddatamatch

Themobilitymeasuresrequirefourdatainputs:

Ø Actualtravelspeed

Ø Free-flowtravelspeed

Ø Vehiclevolume

Ø Vehicleoccupancy(personspervehicle)tocalculateperson-hoursoftraveldelay

This paper could be referred in future to calculate the congestion factor to be included as another

independentvariableinthetravel-timepredictingmodel.

Page 32: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 13|52

AppendixII:InformationonExternalDataSources

NHTS:NationalHouseholdtravelsurvey

Data:NHTS2001and2009

NHTS isatravelsurveythatcollects informationaboutpeople,especiallyan individual’stravelbehavior

over a period through questionnaire. These surveys collect information about demographics of the

individual. Especially information about socio-economic status, household (size and structure), vehicle

data,travelmode,vehiclesowned,vehiclesused,purposeofthejourney,startingpointandendingpoint

ofthejourneyandnumberofpeoplewhothepersontravelledwith.

TheNationalHouseholdTravelSurvey(NHTS)providesinformationtoassisttransportationplannersand

policymakerswhoneedcomprehensivedataontravelandtransportationpatternsintheUnitedStates.

The 2009 NHTS updates information gathered in the 2001 NHTS and in prior Nationwide Personal

TransportationSurveys(NPTS)conductedin1969,1977,1983,1990,and1995.

Datacollected:

TheNHTS/NPTSactsasnation’sinventoryoftraveldata.Thedataiscollectedondailytripstakenina24-

hourperiod,andincludes:

Ø Purposeofthetrip(work,shopping,etc.)

Ø Meansoftransportationused(car,bus,subway,walk,etc.)

Ø Traveltime

Ø Timeofdaywhenthetriptookplace

Ø Dayofweekwhenthetriptookplace

Foraprivatevehicletrip,italsocaptures

Ø numberofpeopleinthevehicle,i.e.,vehicleoccupancy

Ø Drivercharacteristics(age,sex,workerstatus,educationlevel,etc.)

Ø Vehicleattributes(make,model,modelyear,amountofmilesdriveninayear)

Scope–WhattheNHTSIncludes

Ø Householddataonmembers,education,income

Ø Housingcharacteristics,informationoneachvehicle,includingyear,make,model,andestimates

ofannualmilestraveled

Ø Informationontravelaspartofwork;Dataaboutone-waytripstakenduringadesignated24-hour

period

Ø Informationtodescribecharacteristicsofgeographicalarea

Scope—WhatIsNotIncluded

Ø Costsoftravel

Ø Specifictravelroutesortypesofroadsused

Page 33: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 14|52

Ø Travelofthesampledhouseholdchangesovertime

Ø Informationthatwouldidentifytheexacthouseholdorworkplacelocation

APCAVL-Automaticpassengercounterandautomaticvehiclelocationdata

Location:Pittsburgh

Time:2012,2013,2014

Dataused:2013/2014

AVLAutomaticVehicleLocation(AVL)Systems,isapartofITS(intelligenttransportationsystem),

whichhasbeenadoptedbymanytransitagenciestotracktheirtransitvehiclesinrealtime.Itisessentially

avehicletrackingsystemthatcombinestheuseofautomaticvehiclelocationgatheredbyGPSattachedto

individualvehicleswithsoftwarethatcollectsafleetofdataforacomprehensivepicture.Thedatacollected

includesvehiclelocations,howlongtheytravel,wheretheystop,howlongittakestomovefromonestop

to another and the distance between the stops. Urban public transit authorities are an increasingly

commonuserofvehicletrackingsystems,particularlyinlargecitieslikePittsburgh.

AutomatedPassengerCounter (APC) on theotherhand is a device installedon transit vehicles,whichaccurately records boarding and alighting data. This device accurately tracks transit ridership when

compared to the traditionalmethods of accounting by drivers or estimation through surveying. These

devicesarebecomingmorecommonamongAmericantransitoperatorsseekingtoimprovetheaccuracy

ofreportingpatronageaswellasanalyzingtransitusepatternsbylinkingboardingandalightingdatawith

stoplocation.

AschematicdiagramforAVLAPCsystemisgivenbelow.

AUTOMATICVEHICLELOCATION

References:ntl.bts.gov

Page 34: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 15|52

AUTOMATICPASSENGERCOUNTER

References:Pinterest

Figure2Section1

AsampleofhowAPCAVLsystemcollectsdataisgivenbelowalongwiththevariabledescription.

Page 35: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 2|52

Page 36: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 1|52

Figure3section1

Variabledescription

Figure4section1

PittsburghParkingTerminals

Location:Pittsburgh

Time:2013and2014

Thedata(showninthesnapshotbelow)isforPittsburghdowntownparkingmeteroccupancycontainsall

DowntownPittsburghparkingterminalsidentifiedby"terminalID".Thedatafilegiventotheteamwasin

geoJSONformatandPPA_terminalscontainsparkingoccupancydataforentirePittsburgh

PittsburghparkingterminalsdatafileingeoJSONformat

Figure5section1

PittsburghParkingterminalconvertedtotableformat

Page 37: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 2|52

Figure6section1

INRIX

INRIXisaglobalSaaSandDaaScompany,whichprovidesavarietyofInternetservicesandmobile

applications pertaining to road traffic and driver services. INRIX provides historical, real-time traffic

information,trafficforecasts,traveltimes,traveltimepolygonsandtrafficcount.(References:Inrixwebsite

andWiki)

Wehadaccessto informationforPittsburghandall fields for INRIXdataforNorthbound,southbound,

eastbound,westbound,clockwiseandcounterclockwiseroadsinAlleghenycounty,PA(1930tmcs).The

variabledescriptionandsnapshotisprovidedinnextpage.

Page 38: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 3|52

SnapshotofINRIXdata

Figure7section1

Page 39: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 4|52

AppendixIII:DataCleaningSteps

TOOLSUSED:

Ø RStudio

Ø Weka

Ø Excel

APCAVL:Automaticpassengercounterandautomaticvehiclelocationdata

Availabledata:September2012,2013and2014.(RefertoFigure3and4inSection1).Thedatasetclearly

containsmanyvariables.However,lookingatitcloselytheteamrealizedthatitdoesnotcontainthemost

importantvariablesthatidentifythelocationofthebusstop,whichisthegeographiccoordinates.

Geographiccoordinatesactasthemissinglinkbetweenfindingthetractsthroughwhichthebus

transits. In order to find this data, we used the Pittsburgh GIS data from the Pittsburgh City website.

http://pittsburghpa.gov/dcp/gis/gis-data

Keyused:BusStopID

Thefollowingfileswereusedtoobtainthemissinginformationandcombinethedatasets

Ø CensusTracts2010:ContainsalldetailsaboutcensustractsinPittsburgh

Ø Neighborhoods:ContainsdetailsaboutallneighborhoodsinPittsburgh

Ø CensusBlocks2010:ContainsdetailsaboutallblocksandblockgroupsinPittsburgh

Ø Port Authority Bus Stops: Contains details about all bus stops and their respective stop IDs in

Pittsburgh

PittsburghDowntownparkingterminals

Available data: September 2013 and 2014. Refer to figure 6 section 1. Missing information about

neighborhoodandtracts.Datawasataverygranularlevel(TerminalIDlevelthatisageographicpoint).

Therefore,itneededtobescaleduptotractlevel.WeusedtheterminalIDscoordinateswhichactedas

themissinglinkbetweenfindingthetracts.Inordertofindthisdata,weusedthePittsburghGISdatafrom

thePittsburghcitywebsite.Keyusedtojointhedatasetiscoordinates.

http://pittsburghpa.gov/dcp/gis/gis-data

Thefollowingfileswereusedtoobtainthemissinginformationandcombinethedatasets

Ø CensusTracts2010:ContainsalldetailsaboutcensustractsinPittsburgh

Ø Neighborhoods:ContainsdetailsaboutallneighborhoodsinPittsburgh

Ø CensusBlocks2010:ContainsdetailsaboutallblocksandblockgroupsinPittsburgh

Page 40: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 5|52

Afterobtainingthedata,weidentified138majortractsinPittsburgh,isolatedtheparkingdataforallthese

138tracts,andaggregatedthenumberofparkingtransactionsineachtract.Thisdatawasveryrelevantin

creatingtheregressionmodelthatwillbediscussedlater.

INRIX

Availabledataset:September2011and2013.Refer(figure7section1).Thisdatasetconsistsofvehicle

speedingdatafordifferentTMCroadsegmentsanditsrespectivecoordinates.Again,themissingdatais

tractFIPScode.Weusedthecoordinateswhichactedasthemissing linkbetweenfindingthetracts. In

OrdertofindthisdataweusedthePittsburghGISdatafromthePittsburghcitywebsite.Keyusedtojoinis

coordinates.

http://pittsburghpa.gov/dcp/gis/gis-data

Thefollowingfileswereusedtoobtainthemissinginformationandcombinethedatasets

Ø CensusTracts2010:ContainsalldetailsaboutcensustractsinPittsburgh

Ø Neighborhoods:ContainsdetailsaboutallneighborhoodsinPittsburgh

Ø CensusBlocks2010:ContainsdetailsaboutallblocksandblockgroupsinPittsburgh

Page 41: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 6|52

AppendixIV:Models

LogisticRegression

Itisastatisticalmethodforanalyzingadatasetinwhichthereareoneormoreindependentvariablesthat

determineanoutcome.Theoutcomeismeasuredwithadichotomousvariable(inwhichthereareonly

twopossibleoutcomes).

AppliedinNHTSmodeltopredictmodeoftransporttakenbycommutertoreachworklocation.

MultinomialLogisticRegression

Itisthelinearregressionanalysistoconductwhenthedependentvariableisnominalwithmorethantwo

levels.Thus,itisanextensionoflogisticregression,whichanalyzesdichotomous(binary)dependents.

AppliedinACSmodeltopredictmodeoftransporttakenbycommutertoreachworklocation.

ArithmeticMean

Itissimply"mean")ofasample ,usuallydenotedby ,isthesumofthesampledvalues

dividedbythenumberofitemsinthesample:

SimpleLinearRegression

Itistheleastsquaresestimatorofalinearregressionmodelwithasingleexplanatoryvariable,fitsastraight

linethroughthesetofnpointsinsuchawaythatmakesthesumofsquaredresidualsofthemodel(that

is,verticaldistancesbetweenthepointsofthedatasetandthefittedline)assmallaspossible.

Theoutcomevariableisrelatedtoasinglepredictor.Theslopeofthefittedlineisequaltothecorrelation

betweenyandxcorrectedbytheratioofstandarddeviationsofthesevariables.Theinterceptofthefittedlineissuchthatitpassesthroughthecenterofmass(x,y)ofthedatapoints.

Appliedinmodelusedtopredicttimetakenbyacommutertoreachhisworklocation.

Page 42: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 7|52

AppendixV:Code

Codeusedtopredictmodeoftransport:Pittsburgh_Mode_Choice.R

(ApplicableforboththestepsofTask1)

library("nnet")

#Functionforimportcsvfile

import.csv<-function(filename){

return(read.csv(filename,sep=",",header=TRUE))

}

#Importdataset

pa<-import.csv('mergedPUMS2013.csv')

#ExtractonlyPittsburghCitydatabyPUMAcode

pit<-subset(pa,PUMA==01701|PUMA==01702)

#Createanemptydataframeforlateruse

final<-data.frame()

#RandomSampling

for(iin1:1000){

#Shuffle

pit<-pit[sample(nrow(pit)),]

#20%

#pit.20<-pit[1:189,]

#40%

#pit.40<-pit[1:377,]

#60%

#pit.60<-pit[1:566,]

#80%

#pit.80<-pit[1:754,]

#100%

Page 43: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 8|52

pit.100<-pit

#Combinetheresults

final<-rbind(final,results(pit.100))

}

#Replacemissingvalues

final[is.na(final)]<-0

#Calulatethemeanaccuracyasfinalresults

onethousandtimes.hundredpercent<-data.frame(Car=mean(final$Car),

Bus=mean(final$Bus),

StreetCar=mean(final$Streetcar),

Subway=mean(final$Subway),

Railroad=mean(final$Railroad),

Ferryboat=mean(final$Ferryboat),

Taxicab=mean(final$Taxicab),

Motorcycle=mean(final$Motorcycle),

Bicycle=mean(final$Bicycle),

Walked=mean(final$Walked),

Worked.at.home=mean(final$Worked.at.home),

Other.method=mean(final$Other.method)

)

#Functionforthemodel

results<-function(df){

#Variables

variables <-

c("JWMNP","FMRGIP","MIG","PINCP","COW","JWRIP","INTP","FLANXP","PWGTP","SPORDER","PERNP","HI

NS2","TYPE","FS","RACASN","RNTM","JWTR")

#Subsetthedatasetbyprovidedvariables

df<-df[variables]

#Dataprocessing

Page 44: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 9|52

df<-subset(df,JWTR!="")

df[is.na(df)]<-0

df$FMRGIP<-as.factor(df$FMRGIP)

df$MIG<-as.factor(df$MIG)

df$COW<-as.factor(df$COW)

df$JWRIP<-as.factor(df$JWRIP)

df$FLANXP<-as.factor(df$FLANXP)

df$HINS2<-as.factor(df$HINS2)

df$TYPE<-as.factor(df$TYPE)

df$FS<-as.factor(df$FS)

df$RACASN<-as.factor(df$RACASN)

df$RNTM<-as.factor(df$RNTM)

df$JWTR<-as.factor(df$JWTR)

private<-subset(df,JWTR=="1")

public<-subset(df,JWTR!="1")

#Functionfor10folderscrossvalidation

smoothCV<-function(x,y,K=10){

result<-data.frame()

x<-c(x)

y<-c(y)

df<-data.frame(x,y)

df<-df[sample(nrow(df)),]

folds=cut(1:nrow(df),breaks=K,labels=FALSE)

for(jin1:K){

train<-df[folds!=j,]

test<-df[folds==j,]

fit<-multinom(JWTR~.,data=train,MaxNWts=8268)

predicted2<-predict(fit,test,type="class")

Page 45: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 10|52

result<-rbind(result,data.frame(pred=predicted2,true=test$JWTR))

}

return(result)

}

#Runcrossvalidation

x<-df[,names(df)!="JWTR"]

y<-data.frame(df$JWTR)

colnames(y)<-c("JWTR")

result<-smoothCV(x,y,10)

#Createvariablesforlateruse

Car<-0

Bus<-0

Streetcar<-0

Subway<-0

Railroad<-0

Ferryboat<-0

Taxicab<-0

Motorcycle<-0

Bicycle<-0

Walked<-0

Worked.at.home<-0

Other.method<-0

#Accumulatetheoccurrencesofeachtransportationtool

for(iin1:nrow(result)){

if(result$true[i]=="1"&&result$pred[i]=="1"){

Car<-Car+1

}elseif(result$true[i]=="2"&&result$pred[i]=="2"){

Bus<-Bus+1

}elseif(result$true[i]=="3"&&result$pred[i]=="3"){

Page 46: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 11|52

Streetcar<-Streetcar+1

}elseif(result$true[i]=="4"&&result$pred[i]=="4"){

Subway<-Subway+1

}elseif(result$true[i]=="5"&&result$pred[i]=="5"){

Railroad<-Railroad+1

}elseif(result$true[i]=="6"&&result$pred[i]=="6"){

Ferryboat<-Ferryboat+1

}elseif(result$true[i]=="7"&&result$pred[i]=="7"){

Taxicab<-Taxicab+1

}elseif(result$true[i]=="8"&&result$pred[i]=="8"){

Motorcycle<-Motorcycle+1

}elseif(result$true[i]=="9"&&result$pred[i]=="9"){

Bicycle<-Bicycle+1

}elseif(result$true[i]=="10"&&result$pred[i]=="10"){

Walked<-Walked+1

}elseif(result$true[i]=="11"&&result$pred[i]=="11"){

Worked.at.home<-Worked.at.home+1

}elseif(result$true[i]=="12"&&result$pred[i]=="12"){

Other.method<-Other.method+1

}

}

#Calculatetheaccuracy

car<-subset(df,JWTR=="1")

car.accurary<-Car/nrow(car)

bus<-subset(df,JWTR=="2")

bus.accurary<-Bus/nrow(bus)

streetcar<-subset(df,JWTR=="3")

streetcar.accurary<-Streetcar/nrow(streetcar)

subway<-subset(df,JWTR=="4")

Page 47: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 12|52

subway.accurary<-Subway/nrow(subway)

railroad<-subset(df,JWTR=="5")

railroad.accurary<-Railroad/nrow(railroad)

ferryboat<-subset(df,JWTR=="6")

ferryboat.accurary<-Ferryboat/nrow(ferryboat)

taxicab<-subset(df,JWTR=="7")

taxicab.accurary<-Taxicab/nrow(taxicab)

motorcycle<-subset(df,JWTR=="8")

motorcycle.accurary<-Motorcycle/nrow(motorcycle)

bicycle<-subset(df,JWTR=="9")

bicycle.accurary<-Bicycle/nrow(bicycle)

walked<-subset(df,JWTR=="10")

walked.accurary<-Walked/nrow(walked)

worked.at.home<-subset(df,JWTR=="11")

worked.at.home.accurary<-Worked.at.home/nrow(worked.at.home)

other.method<-subset(df,JWTR=="12")

other.method.accurary<-Other.method/nrow(other.method)

accu<-data.frame(Car=car.accurary,

Bus=bus.accurary,

Streetcar=streetcar.accurary,

Subway=subway.accurary,

Railroad=railroad.accurary,

Ferryboat=ferryboat.accurary,

Taxicab=taxicab.accurary,

Motorcycle=motorcycle.accurary,

Bicycle=bicycle.accurary,

Walked=walked.accurary,

Worked.at.home=worked.at.home.accurary,

Other.method=other.method.accurary

Page 48: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 13|52

)

return(accu)

}

CodeusedtogettheFIPScodeofstopfromtheCensusAPI:getStopFIPS.py

importre

importrequests

frombs4importBeautifulSoup,Tag

importcsv

defmain():

a=1

map={}

withopen('stopIDFIPS','w',newline='')asfp:

a=csv.writer(fp,delimiter=',')

title=["stopID","FIPS"]

a.writerow(title)

withopen('PAAC_Stops_1603_Public2.csv','r')ascsvfile:

readCSV=csv.reader(csvfile,delimiter=',')

forrowinreadCSV:

ifa==1:

a=2

continue

all=row[4]+"+"+row[5]

value=map.get(all)

ifvalue==None:

url="http://data.fcc.gov/api/block/find?format=json&latitude="+row[4]+"&longitude="+row[5]+"

&showall=true"

html=requests.get(url).content

htmltxt=BeautifulSoup(html,'html.parser')

matchobj=re.findall(r'{"Block":{"FIPS":"(\d+)"},"County":',str(htmltxt))

iflen(matchobj)>0:

map[all]=matchobj[0]

value=matchobj[0]

ifvalue!=None:

data=[row[0],value]

a.writerow(data)

main()

Page 49: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 14|52

Codeusedtofiltermeantraveltimefromtractleveltocountylevel:filter.py

importre

importrequests

frombs4importBeautifulSoup,Tag

importcsv

defmain():

a=1

begin=set()

map={}

PUMAMap={}

withopen('filteredMeanTT.csv','w',newline='')asfp:

a=csv.writer(fp,delimiter=',')

title=["PUMAtract","meanTT"]

a.writerow(title)

withopen('NeighborhoodPuma.csv','r')ascsvfile:

readCSV=csv.reader(csvfile,delimiter=',')

forrowinreadCSV:

ifa==1:

a=2

continue

puma=row[0]

tract=row[2][0:11]

begin.add(tract)

PUMAMap[tract]=puma

withopen('FIPSMeanTT.csv','r')asinput:

readInput=csv.reader(input,delimiter=',')

forrowinreadInput:

try:

FIPS=row[0]

time=float(row[1])

FIPSstart=FIPS[0:11]

except:

continue

ifFIPSstartinbegin:

pumaValue=PUMAMap.get(FIPSstart)

value=map.get(pumaValue)

ifvalue==None:

map[pumaValue]=[time,1]

else:

map[pumaValue]=[value[0]+time,value[1]+1]

forkey,valueinmap.items():

meanTT=value[0]/value[1]

data=[key,meanTT]

a.writerow(data)

Page 50: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 15|52

main()

Codeusedtocalculatethemeantime:calculateMeanTT.py

importre

importrequests

importcsv

fromosimportlistdir

fromos.pathimportisfile,join

fromcollectionsimportdeque

defmain():

countmiss=0

countsucc=0

FIPSMap={}

trackMap={}

d=deque()

countMissOff=0

withopen('FIPSMeanTT.csv','w',newline='')asfp:

a=csv.writer(fp,delimiter=',')

title=["FIPS","meanTT"]

a.writerow(title)

withopen('stopIDFIPS.csv','r',newline='')ascsvfile1:

readCSV1=csv.reader(csvfile1,delimiter=',')

forrowinreadCSV1:

FIPSMap[row[0]]=row[1]

mypath='/Users/Yun/Documents/capstone/findTrackPublicMeanTT/StopInfo'

onlyfiles=[fforfinlistdir(mypath)ifisfile(join(mypath,f))]

forfileinonlyfiles:

iffile.endswith(".csv"):

withopen(mypath+"/"+file,'r')ascsvfile:

readCSV=csv.reader(csvfile,delimiter=',')

forrowinreadCSV:

iflen(row)>20:

ifrow[0]!='DOW':

try:

stopID=row[8].strip()

on=int(row[16])

off=int(row[17])

min=float(row[20])

exceptException:

continue

FIPS=FIPSMap.get(stopID)

Page 51: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 16|52

ifFIPS==None:

countmiss=countmiss+1

continue

else:

countsucc=countsucc+1

foriind:

trackInfo=trackMap.get(i)

iftrackInfo==None:

trackMap[i]=[min,0]

else:

iflen(trackInfo)==2:

trackMap[i]=[trackInfo[0]+min,trackInfo[1]]

trackInfoCur=trackMap.get(FIPS)

iftrackInfoCur==None:

trackMap[FIPS]=[0,0]

trackInfoCur=[0,0]

trackMap[FIPS]=[trackInfoCur[0],trackInfoCur[1]+on]

whileon>0:

d.append(FIPS)

on=on-1

whileoff>0:

iflen(d)==0:

countMissOff=countMissOff+off

break

else:

d.popleft()

off=off-1

whilecountMissOff>0:

iflen(d)>0:

d.popleft()

countMissOff=countMissOff-1

else:

break

forkey,valueintrackMap.items():

ifvalue[1]==0:

continue

else:

meanTT=value[0]/value[1]

print(key,value)

data=[key,meanTT]

a.writerow(data)

print(countmiss,countsucc)

main()

Page 52: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 17|52

Codeusedtopredictthetravel-time:travel_time.R

Lm (formula = X$Avg_TT ~X$Public_MTT + X$Parking_Transactions + X$Population_Total +

X$Means_of_Transport_Total,data=X)

Page 53: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 18|52

References

Ø NationalHouseholdTravelSurveyDataAnalysis:Ø U.S.CensusBureauprojectionsto2060methodology:

http://www.census.gov/population/projections/data/national/2014.htmlØ U.S.CensusBureauprojectionsto2060summarytables:

http://www.census.gov/population/projections/data/national/2014/summarytables.htmlØ SummaryreportabouttheU.S.CensusBureauprojectionsto2060:

http://www.census.gov/content/dam/Census/library/publications/2015/demo/p25-1143.pdfØ ACSLongitudinalEmployer-HouseholdDynamics:Job-to-JobFlows(J2J)Data(Beta)DataFiles:

http://lehd.ces.census.gov/data/j2j_beta.htmlØ ACSLongitudinalEmployer-HouseholdDynamics:Job-to-JobFlows(J2J)Data(Beta)Data

dictionary:http://lehd.ces.census.gov/data/schema/V4.1c-draft/lehd_public_use_schema.html

Ø MeasuringCommutingintheAmericanTimeUseSurvey:http://bae.uncg.edu/econ/files/2015/02/Kimbrough-working-paper-15-02.pdf

Ø OhioTIMES:transportationinformationmappingsystemhttp://gis.dot.state.oh.us/tims/Ø 2011OregonHouseholdActivitySurvey:

http://www.oregon.gov/ODOT/TD/TP/Pages/Data.aspxØ TheImpactsofSocio-EconomicandDemographicShiftsinTransitServedNeighborhoodsOn

ModeChoiceAndEquitybyStevenApellØ DemographicTrendsTheEffectsofDemographicChangeonSelectedTransportationServicesand

DemandbyMurdockSH,ClineME,ZeyM,PerezD,&JeantyPWØ ModeChoicesofMillennials:HowDifferent?HowEnduring?byRobertB.Case,PE,PhD,Seth

SchipinskiØ TravelTimeUseOverFiveDecadesbyChenSong&ChaoWeiØ NortheastTravelChoiceSurvey(NTCS)UniversityofVermont’sTransportationResearchCenter

(datanotpublic)Ø 2015UrbanMobilityScorecardpublishedjointlybytheTexasA&MTransportationInstituteand

INRIXØ INRIXInterfaceGuideDecember2014publishedbyINRIXØ ImpactofTrafficCongestiononBusTravelTimeinNorthernNewJerseybyClaireE.McKnight,

HerbertS.Levinson,KaanOzbay,CamilleKamga,andRobertE.PaaswellØ CommutinginAmericaReportIII-TheThirdNationalReportonCommutingPatternsandTrends

byAlanE.PisarskiØ RStudioTool:https://www.rstudio.com/products/rstudio/download/Ø WekaTool:https://weka.waikato.ac.nz/explorer Ø NationalHouseholdTravelSurveyData:Ø INRIXdata:Heinz College,Carnegie Mellon University Research CentreØ PittsburghDowntownParkingTerminals:Heinz College, Carnegie Mellon University

Research CentreØ Automaticpassengercounterandautomaticvehiclelocationdata:Heinz College, Carnegie

Mellon University Research Centre

Page 54: Capstone Project - U.S. Census Bureau, Center for Adaptive ...mac.heinz.cmu.edu/wp-content/uploads/2018/05/...CAD 11-30-15 Capstone Project - U.S. Census Bureau, Center for Adaptive

P a g e 19|52

Ø PittsburghGISinformation:http://pittsburghpa.gov/dcp/gis/gis-dataØ DescriptionofStatisticalModels:Wikipedia and www.statisticssolutions.comØ GoodnessofFitexplanationforstatisticalmodel:

http://www.medicine.mcgill.ca/epidemiology/joseph/courses/EPIB-621/fit