big data airline project at uaeu

34
Big Data Airlines Project ZIYAD SALEH

Upload: ziyad-saleh

Post on 09-Aug-2015

54 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Big Data Airline Project at UAEU

Big DataAirlines ProjectZIYAD SALEH

Page 2: Big Data Airline Project at UAEU

What is Big Data? Big data is a broad term for very large or complex data sets that are

difficult to process using traditional data processing applications . Big Data is Terra bytes (1024 GB) of data to be processed and

analyzed, terra bytes of new data is being generated daily, which means the speed of analyzing this huge flow of data is a challenge.

Big data can be described by the 4 Vs which are: Volume, Velocity, Variety and Veracity.

Page 3: Big Data Airline Project at UAEU
Page 4: Big Data Airline Project at UAEU
Page 5: Big Data Airline Project at UAEU
Page 6: Big Data Airline Project at UAEU

Small Data Vs. Big Data

Page 7: Big Data Airline Project at UAEU
Page 9: Big Data Airline Project at UAEU

Map Reduce

Page 10: Big Data Airline Project at UAEU

Map Reduce model

Page 11: Big Data Airline Project at UAEU
Page 12: Big Data Airline Project at UAEU

Project Scope

Page 13: Big Data Airline Project at UAEU

The Scope is limited to :1. Installing and configuring Hadoop Map/Reduce

platform. 2. Analyzing a big data sample belonging to U.S

domestic flights performance and delay for 5 years to try to figure out

1. Top carriers experiencing delays. 2. Top airports and states with departure delays.

3. Plotting state delay in a thematic map of USA

Page 14: Big Data Airline Project at UAEU

Source of Data for the project Datasets will be collected

from: U.S. Department of

Transportation's (DOT) – Statistical Computing

Page 15: Big Data Airline Project at UAEU

Dataset size will be between 500 MB and 1 TB and covering 5 years of flight statistics.

Size of Data

Page 16: Big Data Airline Project at UAEU

Field Name Description Year Year of the scheduled flightMonth Month of the scheduled flight (1–12).Day Day of the month (1–31).DepTime Actual departure time of the flightCRSDepTime Scheduled departure timeArrTime Actual arrival time in HH/MM formatCRSArrTime Scheduled arrival timeFlightNum Flight number.ArrDelay Arrival delayDepDelay departure delay, in minutesCarrierDelay Delay (in minutes) caused by factors within control of the carrier.WeatherDelay Delay (in minutes) caused by extreme weather conditionsNASDelay Delay (in minutes) within the control of the National Airspace System (NAS)

SecurityDelay Security delay (in minutes) caused by security reasonsLateAircraftDelay Delay (in minutes) due to the same aircraft arriving late at a previous airport.

Table 1 : Airline Dataset Dictionary.

Page 17: Big Data Airline Project at UAEU

Data Pre-Processing , Processing and Analytics

Data pre-processing:Data will be cleansed and some artifacts will be filtered out as necessary. Many fields in the airline data set need to be discarded as they are irrelevant to the subject of delay that we are concerned on.

Data Processing and Analytics :Data will be processed using java programming on Map/Reduce to reduce the size of the data and produce an organized smaller datasets. Next, the resulting datasets will be analyzed using additional tools like R.

Page 18: Big Data Airline Project at UAEU

Data Storage

Data will be stored in the HDFS multiple storage nodes with total size between 500 GB and 1 TB.

Airlines Big Data

HDFS

Page 19: Big Data Airline Project at UAEU

Target Analysis: During the 5 years of all US domestic airlines flight

information

1. Which carriers have the most aggregated delay in their flights ?

2. What are the states with most delays. ) ?

Page 20: Big Data Airline Project at UAEU

Design

Page 21: Big Data Airline Project at UAEU

Airlines Project Workflow and Design

Master Node Node 1

Node 2

Node 3

Node 4

Name Node

Job Tracker

Airlines Big Data

Task

Java Code

Reducer Node

HDFSMapper

ReducerTop Airlines

Page 22: Big Data Airline Project at UAEU

Implementation

Page 23: Big Data Airline Project at UAEU

Software and Tools1. CentOS Linux Operating System.2. Apache Hadoop3. Cloudera CDH 5.3 virtual machine4. Oracle VM Virtual Box Manager5. Eclipse IDE6. Java (Oracle JDK )7. Maven8. Microsoft Excel and Access 2010.9. The R statistical tool

Page 24: Big Data Airline Project at UAEU

Mapper :

Page 25: Big Data Airline Project at UAEU

Reducer:

Page 26: Big Data Airline Project at UAEU

R:

Page 27: Big Data Airline Project at UAEU

Findings

Page 28: Big Data Airline Project at UAEU

US Airlines Delay (Per Carrier)

WN AA OO MQ US DL UA XE NW CO EV 9E FL YV OH B6 AS F9 HA AQ PI HP EA PS TW0

0.2

0.4

0.6

0.8

1

1.2

ArrivalOnTimeArrivalDelaysDepartureOnTimeDepartureDelaysCancellationsDiversions

Page 29: Big Data Airline Project at UAEU

Thematic Map of US Airlines Delay (Per State)

Page 30: Big Data Airline Project at UAEU

Conclusion

Page 31: Big Data Airline Project at UAEU

Conclusion:

Big Data is the large amount of continuously generated data that cannot be processed and analyzed using traditional data management tools .

Big data is a new topic that is rising dramatically , reshaping the future , and a large demand for big data scientist is taking place and will continue to happen during the coming period of time.

Hadoop is an open source framework for storing and processing large datasets using clusters of commodity hardware.

Big Data analytics is attracting both business and policy makers to leverage from this new phenomenon towards more informed decisions and planning for the future.

Big Data now , Normal Data tomorrow.

Page 32: Big Data Airline Project at UAEU

Big Data Tutorials

Page 33: Big Data Airline Project at UAEU

Online Big Data Tutorials:

1. Udemy : https://www.udemy.com/course/subscribe/?courseId=336982&dtcode=lGCe31035ujY

2. Udacity : https://www.udacity.com/courses#!/data-science

3. EMC : https://education.emc.com/guest/campaign/data_science.aspx

4. Coursera : https://www.coursera.org/course/datasci

5. CalTech’s : Learning from Data http://work.caltech.edu/telecourse.html

6. MIT : Open Courseware http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/index.htm

7. Stanford’s OpenClassroom http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning

8. Big Data University : https://bigdatauniversity.com/curriculum-map/

Page 34: Big Data Airline Project at UAEU

Thank You

Ziyad Saleh

34

علمتنا .. بما وانفعنا ينفعنا ما علمنا اللهمعلما وزدنا