big data airline project at uaeu
TRANSCRIPT
Big DataAirlines ProjectZIYAD SALEH
What is Big Data? Big data is a broad term for very large or complex data sets that are
difficult to process using traditional data processing applications . Big Data is Terra bytes (1024 GB) of data to be processed and
analyzed, terra bytes of new data is being generated daily, which means the speed of analyzing this huge flow of data is a challenge.
Big data can be described by the 4 Vs which are: Volume, Velocity, Variety and Veracity.
Small Data Vs. Big Data
Map Reduce
Map Reduce model
Project Scope
The Scope is limited to :1. Installing and configuring Hadoop Map/Reduce
platform. 2. Analyzing a big data sample belonging to U.S
domestic flights performance and delay for 5 years to try to figure out
1. Top carriers experiencing delays. 2. Top airports and states with departure delays.
3. Plotting state delay in a thematic map of USA
Source of Data for the project Datasets will be collected
from: U.S. Department of
Transportation's (DOT) – Statistical Computing
Dataset size will be between 500 MB and 1 TB and covering 5 years of flight statistics.
Size of Data
Field Name Description Year Year of the scheduled flightMonth Month of the scheduled flight (1–12).Day Day of the month (1–31).DepTime Actual departure time of the flightCRSDepTime Scheduled departure timeArrTime Actual arrival time in HH/MM formatCRSArrTime Scheduled arrival timeFlightNum Flight number.ArrDelay Arrival delayDepDelay departure delay, in minutesCarrierDelay Delay (in minutes) caused by factors within control of the carrier.WeatherDelay Delay (in minutes) caused by extreme weather conditionsNASDelay Delay (in minutes) within the control of the National Airspace System (NAS)
SecurityDelay Security delay (in minutes) caused by security reasonsLateAircraftDelay Delay (in minutes) due to the same aircraft arriving late at a previous airport.
Table 1 : Airline Dataset Dictionary.
Data Pre-Processing , Processing and Analytics
Data pre-processing:Data will be cleansed and some artifacts will be filtered out as necessary. Many fields in the airline data set need to be discarded as they are irrelevant to the subject of delay that we are concerned on.
Data Processing and Analytics :Data will be processed using java programming on Map/Reduce to reduce the size of the data and produce an organized smaller datasets. Next, the resulting datasets will be analyzed using additional tools like R.
Data Storage
Data will be stored in the HDFS multiple storage nodes with total size between 500 GB and 1 TB.
Airlines Big Data
HDFS
Target Analysis: During the 5 years of all US domestic airlines flight
information
1. Which carriers have the most aggregated delay in their flights ?
2. What are the states with most delays. ) ?
Design
Airlines Project Workflow and Design
Master Node Node 1
Node 2
Node 3
Node 4
Name Node
Job Tracker
Airlines Big Data
Task
Java Code
Reducer Node
HDFSMapper
ReducerTop Airlines
Implementation
Software and Tools1. CentOS Linux Operating System.2. Apache Hadoop3. Cloudera CDH 5.3 virtual machine4. Oracle VM Virtual Box Manager5. Eclipse IDE6. Java (Oracle JDK )7. Maven8. Microsoft Excel and Access 2010.9. The R statistical tool
Mapper :
Reducer:
R:
Findings
US Airlines Delay (Per Carrier)
WN AA OO MQ US DL UA XE NW CO EV 9E FL YV OH B6 AS F9 HA AQ PI HP EA PS TW0
0.2
0.4
0.6
0.8
1
1.2
ArrivalOnTimeArrivalDelaysDepartureOnTimeDepartureDelaysCancellationsDiversions
Thematic Map of US Airlines Delay (Per State)
Conclusion
Conclusion:
Big Data is the large amount of continuously generated data that cannot be processed and analyzed using traditional data management tools .
Big data is a new topic that is rising dramatically , reshaping the future , and a large demand for big data scientist is taking place and will continue to happen during the coming period of time.
Hadoop is an open source framework for storing and processing large datasets using clusters of commodity hardware.
Big Data analytics is attracting both business and policy makers to leverage from this new phenomenon towards more informed decisions and planning for the future.
Big Data now , Normal Data tomorrow.
Big Data Tutorials
Online Big Data Tutorials:
1. Udemy : https://www.udemy.com/course/subscribe/?courseId=336982&dtcode=lGCe31035ujY
2. Udacity : https://www.udacity.com/courses#!/data-science
3. EMC : https://education.emc.com/guest/campaign/data_science.aspx
4. Coursera : https://www.coursera.org/course/datasci
5. CalTech’s : Learning from Data http://work.caltech.edu/telecourse.html
6. MIT : Open Courseware http://ocw.mit.edu/courses/sloan-school-of-management/15-062-data-mining-spring-2003/index.htm
7. Stanford’s OpenClassroom http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning
8. Big Data University : https://bigdatauniversity.com/curriculum-map/
Thank You
Ziyad Saleh
34
علمتنا .. بما وانفعنا ينفعنا ما علمنا اللهمعلما وزدنا