informed traveler

Post on 27-Jun-2015

328 Views

Category:

Engineering

6 Downloads

Preview:

Click to see full reader

DESCRIPTION

Insight Data Engineering Project

TRANSCRIPT

Seda Davtyan

Informed Traveler

Problem

Datasets• RITA (Research and Innovative Technology

Administration) On Time Performance dataset (October, 1987 till July, 2014)o Updated quarterly

• After unzipping all the files (1 file per year and month) the data is about 65 GB

UI Demo• Informed Traveler

Some InsightJet Blue 01/2012-

01/2013On Time: 67.39%

Carrier: 12.33%

Jet Blue 01/2013-01/2014

On Time: 61.36%

Carrier: 14.81%

Southwest 01/2012-01/2013

On Time: 80.76%

Carrier: 6.41%

Southwest 01/2013-01/2014

On Time: 72.17%

Carrier: 9.56%

American Airline

01/2012-01/2013

On Time: 75.48%

Carrier: 8.00%

American Airline

01/2013-01/2014

On Time: 68.70%

Carrier: 9.74%

Delta 01/2012-01/2013

On Time: 80.67%

Carrier: 5.61%

Delta 01/2013-01/2014

On Time: 81.81%

Carrier: 5.76%

Security + Weather Related Delays < 2%

ScreenshotsJet Blue 01/2013 – 01/2014

South West 01/2013 – 01/2014

AA 01/2013 – 01/2014 Delta 01/2013 – 01/2014

Data Pipeline

Flask API

RI TA

Data Collection

Tradeoffs

Calculate the number of delays of each type

Delay > 15

Carrier

Weather

NAS Security

Late Aircraft

Unclassified

New Field

Tradeoffs• Pig to clean up and transform the data

• UDF from piggybank to handle commas in a .csv file (CSVExcelStorage)

• HBaseStorage to transfer the data into HBase

• Construct composite row keys for fast HBase querying o Airline_Year-Month-Dayo Airline_Year-Montho Airline_Year

Seda Davtyan• PhD in CS&E, 2014

• Love Coffee and Camping• Greatly Enjoy Baking

top related