data mining & analytics for u.s. airlines on-time performance

Data Analysis of U.S. Airlines On-time

Performance

Yanxiang Zhu, Nilesh Padwal, Mingxuan Li

Finished by June 27th, 2014

Contents

1 Introduction 21.1 Background and Problem Description . . . . . . . . . . . . . . . 21.2 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Collecting data 3

3 Preprocessing Data 3

4 Variables Description 4

5 Association Rule 9

6 Cluster Analysis 156.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166.2 Determine K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166.3 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.3.1 Pam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.3.2 Kmeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7 Decision Tree 247.1 Categorize Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 247.2 Rpart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267.3 Ctree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

8 Random Forest 31

9 Classification 359.1 knn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359.2 Processing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 359.3 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

1

INFO7374 Data Science Final Project

10 Processing Data 36

11 Conclusion 40

12 Limitation 41

13 Future Work 42

2


1 Introduction

1.1 Background and Problem Description

In airlines industry, It is much more common that airlines are struggling toget plane to the gate on time.The current challenge we all met is to improvethe quality of airline on-time performance. Besides the carriers’ services andbaggage policy, It is seemingly much more necessary to save people’s air time.So our goal is to find something special and valuable relationships from datasetby using data mining techniques like cluster analysis, association rule, decisiontree and etc.

1.2 Dataset Description

The dataset is a airlines’ data collection coming from the Research and Innova-tive Technology Administration (RITA), and it contains detailed facets of eachair flight information between 1987 to 2008. It is huge information which include29 variables like Destination, Origin, Arrival time, Departure time and so on.Here is a original list that show the all variables. It is very important statisticalrecords that any flight information could be tracked via specific features. Thething we need to mention, due to limit performance of our computers, is thatwe are able to fetch a part of the whole data (all the airlines of U.S. during22 years)to process and analyze. Our selected dataset is still have millions ofobservations which are definitely enough to obtain the satisfying outcomes.Hereis a descriptive list of useful variables.

1. DayofMonth December 1st to December 31th.

2. DayOfWeek 1 refers to Monday and in a similar way, 7 refers to Sunday.

3. DepTime Actual departure time

4. ArrTime Actual arrival time

5. CRSDepTime Scheduled departure time

6. CRSArrTime Actual arrival time

7. UniqueCarrier Unique carrier code

8. FlightNum Flight number

9. ActualElapsedTime In minutes

10. CRSElapsedTime In minutes

11. AirTime In minutes

12. ArrDelay Arrival delay, in minutes

13. DepDelay Departure delay, in minutes

3


14. Origin Origin IATA airport code

15. Dest Destination IATA airport code

16. Distance In miles

According to the historical record of On-time flight operation of U.S.air carriers,the 2008 seems like to be interesting and special period for the airline industry,whose on-time percentage is 76.0%, then it went up to 79.5% in 2009. That iswhy we choose such a breaking point to find out what should not be ignoredthat behind the common numbers and words.

2 Collecting data

The dataset we use contains all commercial flights within the USA in 2008. Thedataset is downloaded from http://stat-computing.org/dataexpo/2009.The dataset contains nearly 10 million records and takes 700MB space.

file.name <- paste(2008, "csv.bz2", sep = ".")

if (!file.exists(file.name)) {url.text <- paste("http://stat-computing.org/dataexpo/2009/", 2008, ".csv.bz2",

sep = "")

cat("Downloading missing data file ", file.name, "\n", sep = "")

download.file(url.text, file.name)

}

To import the data into our workspace, we use read.csv function. We storethe dataset in d.

d <- read.csv("2008.csv")

3 Preprocessing Data

Since the analysis still need a well-structured dataset, so we omit the NA values.And due to the limitation of our computer’ processing capability, we decide towork with data from only December,2008. And it still has 1,524,735 observationof 29 variables and we think it is enough to obtain a good data analysis resultsfrom such a large-scale dataset.

d = subset(d, Month == "12")

d = na.omit(d)

After that, we also need to remove some of the columns that we think is notuseful in our study. So we remove them directly from the original dataset.

4


d = d[, -20:-29]

On the other hand, since we already decide to use the data in December, 2008.The Year and Month columns become useless.

d = d[, -1]

d = d[, -1]

So far, our dataset contains 168646 records with 17 variables.

str(d)

## 'data.frame': 168647 obs. of 17 variables:

## $ DayofMonth : int 3 3 3 3 3 3 3 3 3 3 ...

## $ DayOfWeek : int 3 3 3 3 3 3 3 3 3 3 ...

## $ DepTime : int 1126 1859 1256 1925 2002 1716 1620 1807 1930 1004 ...

## $ CRSDepTime : int 1045 1825 1240 1900 1940 1610 1555 1725 1905 1005 ...

## $ ArrTime : int 1241 1925 1458 2120 2249 2054 1826 1910 2041 1130 ...

## $ CRSArrTime : int 1200 1900 1435 2100 2230 1950 1800 1845 2020 1115 ...

## $ UniqueCarrier : Factor w/ 20 levels "9E","AA","AQ",..: 18 18 18 18 18 18 18 18 18 18 ...

## $ FlightNum : int 2717 1712 294 2776 623 586 1259 548 619 1152 ...

## $ TailNum : Factor w/ 5374 levels "","80009E","80019E",..: 3796 2127 3943 3316 1395 3593 3475 1430 2524 1177 ...

## $ ActualElapsedTime: int 75 86 62 55 107 158 186 63 71 86 ...

## $ CRSElapsedTime : int 75 95 55 60 110 160 185 80 75 70 ...

## $ AirTime : int 55 73 45 46 93 140 177 50 56 51 ...

## $ ArrDelay : int 41 25 23 20 19 64 26 25 21 15 ...

## $ DepDelay : int 41 34 16 25 22 66 25 42 25 -1 ...

## $ Origin : Factor w/ 303 levels "ABE","ABI","ABQ",..: 3 3 3 3 3 3 3 3 3 3 ...

## $ Dest : Factor w/ 304 levels "ABE","ABI","ABQ",..: 82 157 160 175 177 181 219 223 223 291 ...

## $ Distance : int 349 487 289 332 718 1121 1111 328 328 321 ...

4 Variables Description

After importing the dataset, the variables associated with each observation wereexplored further. The names of variables were listed and described.

1. DayofMonth December 1st to December 31th.

summary(d$DayofMonth)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 1.0 11.0 18.0 17.2 23.0 31.0

2. DayOfWeek 1 refers to Sunday and in a similar way, 7 refers to Saturday.

summary(d$DayOfWeek)

5



## 1.00 2.00 4.00 3.74 5.00 7.00

3. DepTime Actual departure time

summary(d$DepTime)


## 1 1120 1510 1470 1840 2400

Departure time is another key factor that we are going to examine. Wewant to know which time is the best time for flight.

4. ArrTime Actual arrival time

summary(d$ArrTime)


## 1 1230 1640 1560 2010 2400

5. CRSDepTime Scheduled departure time

summary(d$CRSDepTime)


## 5 1040 1420 1400 1750 2360

6. CRSArrTime Actual arrival time

summary(d$CRSArrTime)


## 1 1230 1620 1580 1950 2360

7. UniqueCarrier Unique carrier code

carrier = data.frame(d$UniqueCarrier)

qplot(x = d$UniqueCarrier, data = carrier, fill = d$UniqueCarrier)

6


0

10000

20000

30000

9E AA AS B6 CO DL EV F9 FL HA MQNWOH OO UA US WN XE YVd$UniqueCarrier

coun

td$UniqueCarrier

9E

AA

AS

B6

CO

DL

EV

F9

FL

HA

MQ

NW

OH

OO

UA

US

WN

XE

YV

Southwest Airline runs most of the airplane in the U.S. in 2008. Thenumber of their flights are even greater than the sum of Skywest Airlineand American Airline. We will also help you find out which airline tochoose if you want to avoid delay.

8. FlightNum Flight number

summary(d$FlightNum)


## 1 658 1680 2360 3590 9740

9. ActualElapsedTime In minutes

summary(d$ActualElapsedTime)


## 18 88 126 144 177 790

10. CRSElapsedTime In minutes

7


summary(d$CRSElapsedTime)


## 26 82 116 135 165 660

11. AirTime In minutes

summary(d$AirTime)


## 6 60 93 112 141 647

12. ArrDelay 4Arrival delay, in minutes

summary(d$ArrDelay)


## 15.0 24.0 41.0 62.6 77.0 1660.0

The arrive delay is our target variable. The median is 40 mins whichmeans the delay problem is severe. We are going to find out which factorswill cause the delay.

13. DepDelay Departure delay, in minutes

summary(d$DepDelay)


## -34.0 15.0 35.0 53.5 71.0 1600.0

14. Origin Origin IATA airport code

summary(d$Origin)

## ATL ORD DEN DFW DTW PHX EWR IAH LAS

## 12232 11020 7004 6208 4984 4353 4333 4269 4020

## MSP LAX JFK SLC SFO SEA CLT BOS PHL

## 4004 3837 3409 3375 3370 2992 2977 2841 2676

## MDW MCO CVG BWI LGA SAN DCA IAD MEM

## 2546 2481 2457 2383 2272 1832 1768 1714 1665

## MIA FLL STL MKE TPA MCI BNA CLE PDX

## 1601 1590 1504 1491 1489 1381 1358 1352 1329

## HOU RDU DAL OAK HNL SMF PIT SJC IND

## 1324 1297 1218 1211 1164 1131 1061 1003 989

## SNA ABQ AUS SAT MSY CMH PBI BUF OMA

## 944 905 872 809 806 757 733 727 704

## JAX BDL BUR RSW BHM ONT PVD GRR SDF

## 643 629 627 607 546 521 517 514 504

## TUL OKC RNO DSM SJU RIC MHT DAY MSN

## 493 488 485 478 471 452 426 419 408

## GEG LIT BOI ELP TUS ANC ICT LGB TYS

## 404 404 394 394 389 384 371 367 358

8


## ALB ROC XNA SYR OGG ORF HPN COS CID

## 356 356 344 343 332 322 317 311 292

## CHS FAT LEX GSO MLI CAE HSV SAV JAN

## 287 285 282 278 271 268 260 259 251

## (Other)

## 12768

The most busiest airport in the U.S. is Atlantic airport. Chicago andDenver ranked second and third.

15. Dest Destination IATA airport code

summary(d$Dest)

## ATL ORD DEN DFW LAX PHX LAS EWR SFO

## 11791 9506 6338 5159 5013 4663 4357 4335 4277

## IAH DTW MSP JFK SLC SEA MCO LGA PHL

## 4271 3575 3280 3238 3216 3131 3015 2818 2669

## BOS CLT SAN BWI FLL MDW CVG TPA MEM

## 2541 2422 2265 2142 1959 1912 1831 1748 1721

## DCA MIA IAD PDX RDU MCI STL SMF OAK

## 1710 1702 1494 1483 1455 1384 1374 1361 1351

## BNA CLE MKE SJC HOU SNA SAT DAL AUS

## 1346 1314 1290 1210 1196 1120 1096 1085 1049

## ABQ HNL PIT PBI MSY IND CMH RSW OMA

## 1014 981 935 917 889 841 831 758 757

## JAX BUR ONT BUF TUL OKC SJU BHM TUS

## 737 717 679 664 615 608 604 585 585

## BDL RNO ANC SDF PVD DSM GRR RIC ELP

## 578 567 560 526 505 501 498 485 480

## BOI LIT GEG MSN TYS ICT DAY XNA LGB

## 472 442 424 415 412 408 406 386 381

## MHT COS ORF GSO CHS ROC CAE JAN HPN

## 373 366 362 336 334 327 301 296 295

## SAV CID OGG FAT ALB SYR LEX HSV MLI

## 293 292 292 291 285 280 276 261 255

## (Other)

## 13756

The result is very similar to Origin. We also need to check whether themost busiest airport suffers the delay most.

16. Distance In miles

summary(d$Distance)


## 31 338 599 753 984 4960

9


The most majority of flights have a distance of under 1000 miles. Therelationship between distance and delay time is another import questionwe need to validate.

5 Association Rule

After thinking about air flight performance, we consider that there are some-thing special and important that can address with so that it is quite beneficialto improve the quality of air flights’ on-time performance. In this section, weare goint to find the hidden relationship from different facets among the infor-mation that given by air flight dataset. We raise some certain questions, forinstance, how does the factors like distance, DayOfWeek influence on the on-time performance. So our goal is to solve these questions by using associationrule.

• Support The probability that antecedent and conclusion hold simultane-ously in the data set.

• Condence The conditional probability that conclusion holds if antecedentis satised.

• Lift Lift is the ratio of Confidence to Expected Confidence.Values greaterthan 1 indicate that the rule has predictive potential.

Firstly, we prepare the library that association rule requires.

library(arules)

library(arulesViz)

We copy the original dataset because we will do some further data transformingwork on it. We store it as data.

data = d

According to our observation, it is effective that dividing the numeric variableinto the ordered, reasonable range for our analysis. So we split the Distance,ArrDelay, Airtime, CRSDepTime, CRSArrTime respectively and make sure thatall the observation is included in given range.

data$Distance = ordered(cut(data$Distance, c(0, 300, 600, 1000, Inf)), labels = c("Short",

"Medium", "Long", "Too long"))

data$ArrDelay = ordered(cut(data$ArrDelay, c(0, 25, 50, 80, Inf)), labels = c("On-Time",

"Delayed", "Intermediate-Delayed", "Much-Delayed"))

data$AirTime = ordered(cut(data$AirTime, c(-1, 50, 100, 200, 300, Inf)), labels = c("Too-Short",

"Short", "Intermediate", "Long", "Too-Long"))

10


data$CRSDepTime = ordered(cut(data$CRSDepTime, c(-1, 600, 1200, 1800, Inf)),

labels = c("Overnight", "Morning", "Afternoon", "Evening"))

data$CRSArrTime = ordered(cut(data$CRSArrTime, c(-1, 600, 1200, 1800, 2359)),

labels = c("Overnight", "Morning", "Afternoon", "Evening"))

DayOfWeek contain the number like 1, 2, 3 to represent days of week. Sowe change it into character and replace with string.After manipulating, it istransformed into factor.

data$DayOfWeek = as.character(data$DayOfWeek)

data$DayOfWeek = gsub("^1", "Sunday", data$DayOfWeek)

data$DayOfWeek = gsub("^2", "Monday", data$DayOfWeek)

data$DayOfWeek = gsub("^3", "Tuesday", data$DayOfWeek)

data$DayOfWeek = gsub("^4", "Wednesday", data$DayOfWeek)

data$DayOfWeek = gsub("^5", "Thursday", data$DayOfWeek)

data$DayOfWeek = gsub("^6", "Friday", data$DayOfWeek)

data$DayOfWeek = gsub("^7", "Saturday", data$DayOfWeek)

data$DayOfWeek = factor(data$DayOfWeek)

We have 5 variable that is not that useful so that they need to be removed suchas FlightNum, ActualElapseTime and etc.

logNdx = !(names(data) %in% c("DayofMonth", "FlightNum", "Cancelled", "ActualElapsedTime",

"DepDelay", "UniqueCarrier"))

data.AR = data[, logNdx]

Finishing these processing work above, here the dataset comes to analyze calleddata.AR. It contains 10 variables shown below.

summary(data.AR)

## DayOfWeek CRSDepTime CRSArrTime TailNum

## Friday :20526 Overnight: 3009 Overnight: 2050 N986CA : 129

## Monday :27345 Morning :54579 Morning :35361 N87353 : 126

## Saturday :21397 Afternoon:72468 Afternoon:67079 N77302 : 122

## Sunday :30021 Evening :38591 Evening :64157 N507CA : 112

## Thursday :22742 N472CA : 107

## Tuesday :26882 N471CA : 106

## Wednesday:19734 (Other):167945

## AirTime ArrDelay Origin

## Too-Short :28126 On-Time :46402 ATL : 12232

## Short :64977 Delayed :53601 ORD : 11020

## Intermediate:56146 Intermediate-Delayed:29107 DEN : 7004

## Long :14165 Much-Delayed :39537 DFW : 6208

## Too-Long : 5233 DTW : 4984

## PHX : 4353

## (Other):122846

11


## Dest Distance

## ATL : 11791 Short :33685

## ORD : 9506 Medium :50906

## DEN : 6338 Long :43644

## DFW : 5159 Too long:40412

## LAX : 5013

## PHX : 4663

## (Other):126177

We apply association rule mining to the dataset. Firstly, we intend to find outthe main factors that could possible result in air flight on-time or not. We givethe index of support and confidence respectively and at right column show outthe four levels, that is On-Time, Delayed, Intermediate-Delayed, Much-Delayed.

apriori.appearance1 = list(rhs = c("ArrDelay=On-Time", "ArrDelay=Delayed", "ArrDelay=Intermediate-Delayed",

"ArrDelay=Much-Delayed"), default = "lhs")

apriori.parameter1 = list(support = 0.01, confidence = 0.1)

rules1 = apriori(data.AR, parameter = apriori.parameter1, appearance = apriori.appearance1)

##

## parameter specification:

## confidence minval smax arem aval originalSupport support minlen maxlen

## 0.1 0.1 1 none FALSE TRUE 0.01 1 10

## target ext

## rules FALSE

##

## algorithmic control:

## filter tree heap memopt load sort verbose

## 0.1 TRUE TRUE FALSE TRUE 2 TRUE

##

## apriori - find association rules with the apriori algorithm

## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt

## set item appearances ...[4 item(s)] done [0.00s].

## set transactions ...[5118 item(s), 168647 transaction(s)] done [0.08s].

## sorting and recoding items ... [83 item(s)] done [0.01s].

## creating transaction tree ... done [0.09s].

## checking subsets of size 1 2 3 4 5 done [0.07s].

## writing ... [743 rule(s)] done [0.00s].

## creating S4 object ... done [0.02s].

By giving lift is larger than 1, and we create the subset that ordered by lift.

rules1.subset = subset(rules1, subset = lift > 1.2 & confidence > 0.1)

rules1.subset.conf = sort(rules1.subset, by = "lift")

12


summary(rules1.subset)

## set of 50 rules

##

## rule length distribution (lhs + rhs):sizes

## 2 3 4 5

## 4 26 18 2

##


## 2.00 3.00 3.00 3.36 4.00 5.00

##

## summary of quality measures:

## support confidence lift

## Min. :0.0101 Min. :0.284 Min. :1.20

## 1st Qu.:0.0118 1st Qu.:0.314 1st Qu.:1.24

## Median :0.0147 Median :0.342 Median :1.27

## Mean :0.0182 Mean :0.337 Mean :1.31

## 3rd Qu.:0.0200 3rd Qu.:0.354 3rd Qu.:1.32

## Max. :0.0725 Max. :0.428 Max. :1.83

##

## mining info:

## data ntransactions support confidence

## data.AR 168647 0.01 0.1

The list below displays the top ten rules sorted by lift.

inspect(rules1.subset.conf[1:5])

## lhs rhs support confidence lift

## 1 {Dest=EWR} => {ArrDelay=Much-Delayed} 0.01101 0.4284 1.827

## 2 {CRSDepTime=Afternoon,

## Dest=ORD} => {ArrDelay=Much-Delayed} 0.01040 0.4048 1.727

## 3 {CRSArrTime=Evening,

## Origin=ORD} => {ArrDelay=Much-Delayed} 0.01032 0.3786 1.615

## 4 {Dest=ORD} => {ArrDelay=Much-Delayed} 0.02001 0.3549 1.514

## 5 {Origin=ORD} => {ArrDelay=Much-Delayed} 0.02199 0.3365 1.435

And the flights origin from or land on ORD airport delay with much possibility.We search the weather history records of Chicago O’Hare International Airport(ORD), it did suffer a very severe snowstorm at that time. So even the weathercondition is not included in air flight information, but still is a key of air flighton-time performance.

rules1.subset.delay = subset(rules1.subset, subset = lhs %in% "DayOfWeek=Monday")

plot(rules1.subset.delay, method = "graph", control = list(type = "items"))

13


Graph for 5 rules

DayOfWeek=Monday

CRSDepTime=Morning

CRSDepTime=Evening

CRSArrTime=Morning

CRSArrTime=Evening

ArrDelay=On−Time

ArrDelay=Much−Delayed

size: support (0.011 − 0.019)color: lift (1.213 − 1.421)

apriori.appearance3 = list(rhs = c("ArrDelay=On-Time"), default = "lhs")

apriori.parameter3 = list(support = 0.01, confidence = 0.1)

rules3 = apriori(data.AR, parameter = apriori.parameter3, appearance = apriori.appearance3)

##

## parameter specification:

## confidence minval smax arem aval originalSupport support minlen maxlen

## 0.1 0.1 1 none FALSE TRUE 0.01 1 10

## target ext

## rules FALSE

##

## algorithmic control:

## filter tree heap memopt load sort verbose

## 0.1 TRUE TRUE FALSE TRUE 2 TRUE

##

## apriori - find association rules with the apriori algorithm

14


## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt

## set item appearances ...[1 item(s)] done [0.00s].

## set transactions ...[5118 item(s), 168647 transaction(s)] done [0.08s].

## sorting and recoding items ... [83 item(s)] done [0.01s].

## creating transaction tree ... done [0.09s].

## checking subsets of size 1 2 3 4 5 done [0.06s].

## writing ... [211 rule(s)] done [0.00s].

## creating S4 object ... done [0.02s].

rules3.subset = subset(rules3, subset = lift > 1.2 & confidence > 0.1)

rules3.subset.conf = sort(rules3.subset, by = "lift")

rules3.subset.ontime = subset(rules3.subset.conf, subset = lhs %in% c("DayOfWeek=Friday",

"DayOfWeek=Saturday", "DayOfWeek=Sunday", "DayOfWeek=Monday", "DayOfWeek=Tuesday",

"DayOfWeek=Wednesday", "DayOfWeek=Thursday"))

inspect(rules3.subset.ontime[1:5])


## 1 {DayOfWeek=Monday,

## CRSArrTime=Morning} => {ArrDelay=On-Time} 0.01326 0.3909 1.421


## CRSDepTime=Morning,



## CRSDepTime=Morning} => {ArrDelay=On-Time} 0.01874 0.3624 1.317

## 4 {DayOfWeek=Wednesday,

## CRSDepTime=Morning} => {ArrDelay=On-Time} 0.01283 0.3503 1.273

## 5 {DayOfWeek=Sunday,


So obviously, we could conclude that the flights which own the most high on-time performance always in the morning. It illustrates that in Dec,2008, the airtraffic control is in a great condition in the morning but heavier in afternoonand evening comparatively.

rules1.subset.delay1 = subset(rules1.subset.conf, subset = lhs %in% c("Distance=Short",

"Distance=Medium", "Distance=Long", "Distance=Too long"))

inspect(rules1.subset.delay1[1:10])


## 1 {CRSArrTime=Morning,

## AirTime=Short,

## Distance=Medium} => {ArrDelay=On-Time} 0.02186 0.3679 1.337


## AirTime=Intermediate,

## Distance=Long} => {ArrDelay=On-Time} 0.01494 0.3610 1.312

15


## 3 {CRSDepTime=Morning,

## CRSArrTime=Morning,

## AirTime=Short,








## AirTime=Intermediate,


## 7 {AirTime=Short,

## Origin=ATL,










## Distance=Too long} => {ArrDelay=On-Time} 0.01120 0.3480 1.265

We can find the most flights are possible on time which often have long distanceroutes. That shows that in air control system, the small regions and short routeare much busy and people may have 5 or more choice of flights if they go toNew York City from Boston while there are only 2 flight if your family wantto travel to beautiful San Diego from Washington D.C. That is why the longdistance route have less pressure in air control system and It is easier to meetthe air traffic jam in shorter flight routes.

6 Cluster Analysis

We are going to research the Airline dataset using clustering analysis. Cluster-ing analysis generally refers to sorting observed data into k groups(k indicateshow many groups will be created) so as to minimize the similarity of observationswithin the same group and maximize the similarity of observations across dif-ferent groups. Basically, cluster analysis can be separated into two approaches,hierarchical and non-hierarchical. We are going to run non-hierarchical cluster-ing including K-Means and Pam. For Airline dataset, we will continue finding

16


associations within the dataset and the factors which will influence the ArrDe-lay.

6.1 Setup

First we need to load libraries we are going to use.

library(cluster) # cluster library

library(proxy) # hcluster function

library(fpc) # cluster.stats function

library(pamr) # pam function

library(clValid) # clValid function

library(ggplot2) # plot diagram

Because the original dataset is too large, and it is very difficult to computedistance matrix. So we randomly choose 1000 records and do analysis on thissample.

cd = cd[sample(nrow(d), 1000), ]

m = as.matrix(cd)

str(m)

## int [1:1000, 1:12] 31 10 28 18 14 23 7 20 11 20 ...

## - attr(*, "dimnames")=List of 2

## ..$ : chr [1:1000] "6926813" "6946821" "6545814" "6680323" ...

## ..$ : chr [1:12] "DayofMonth" "DayOfWeek" "DepTime" "CRSDepTime" ...

mDist = dist(m)

6.2 Determine K

The most important thing in cluster analysis is to determine best K. To findthe best case, we will apply clvalid function, which will directly give us theoptimal result.

# clValid

hvalid <- clValid(m, 2:10, clMethods = c("hierarchical"), validation = "internal",

maxitems = 1e+06)

pamvalid <- clValid(m, 2:10, clMethods = c("pam"), validation = "internal",

maxitems = 1e+06)

kvalid <- clValid(m, 2:10, clMethods = c("kmeans"), validation = "internal",

maxitems = 1e+06)

Now we can use the summary() function to see the result of each method,where the Optimal Scores section will directly give us the best clusters number.

17


summary(kvalid)

##

## Clustering Methods:

## kmeans

##

## Cluster sizes:

## 2 3 4 5 6 7 8 9 10

##

## Validation Measures:

## 2 3 4 5 6 7 8 9 10

##

## kmeans Connectivity 2.878 79.275 85.293 86.702 104.193 131.675 105.027 128.007 136.169

## Dunn 0.098 0.012 0.012 0.014 0.027 0.023 0.027 0.023 0.025

## Silhouette 0.471 0.454 0.468 0.470 0.485 0.388 0.486 0.390 0.391

##

## Optimal Scores:

##

## Score Method Clusters

## Connectivity 2.878 kmeans 2

## Dunn 0.098 kmeans 2

## Silhouette 0.486 kmeans 8

summary(pamvalid)

##


## pam

##

## Cluster sizes:

## 2 3 4 5 6 7 8 9 10

##


## 2 3 4 5 6 7 8 9 10

##

## pam Connectivity 91.199 139.967 145.223 171.271 173.816 237.760 221.203 212.537 246.705

## Dunn 0.019 0.013 0.014 0.018 0.009 0.012 0.013 0.013 0.013

## Silhouette 0.404 0.288 0.318 0.360 0.305 0.298 0.320 0.320 0.333

##

## Optimal Scores:

##


## Connectivity 91.199 pam 2

## Dunn 0.019 pam 2

## Silhouette 0.404 pam 2

summary(hvalid)

18


##


## hierarchical

##

## Cluster sizes:

## 2 3 4 5 6 7 8 9 10

##


## 2 3 4 5 6 7 8 9 10

##

## hierarchical Connectivity 5.287 5.287 7.005 11.240 32.673 33.506 35.006 38.148 44.491

## Dunn 0.420 0.420 0.417 0.381 0.073 0.073 0.073 0.074 0.074

## Silhouette 0.485 0.460 0.443 0.426 0.407 0.374 0.368 0.317 0.318

##

## Optimal Scores:

##


## Connectivity 5.287 hierarchical 2

## Dunn 0.420 hierarchical 2

## Silhouette 0.485 hierarchical 2

Because the 1000 sample records are randomly chosen, the results are not alwaysthe same. But we can still find that in most cases, K = 2 will be given. On theother hand, we can also use other measurements to validate our result.We use foreachcluster function to show the 6 measurements for cluster numberfrom 2 to 10.

foreachcluster3 = function(k) {pamC = pam(x = m, k)

p.stats = cluster.stats(mDist, pamC$clustering)

c(max.dia = p.stats$max.diameter, min.sep = p.stats$min.separation, avg.wi = p.stats$average.within,

avg.bw = p.stats$average.between, silwidth = p.stats$avg.silwidth, dunn = p.stats$dunn)

}

We apply this function to cluster numbers from 2 to 10 and use rbind to makea table.

t3 = rbind(foreachcluster3(2), foreachcluster3(3), foreachcluster3(4), foreachcluster3(5),

foreachcluster3(6), foreachcluster3(7), foreachcluster3(8), foreachcluster3(9),

foreachcluster3(10))

rownames(t3) = 2:10

t3

## max.dia min.sep avg.wi avg.bw silwidth dunn

## 2 3899 75.14 1062.9 1811 0.4041 0.019271

## 3 3899 52.03 963.4 1666 0.2884 0.013344

## 4 3802 52.03 797.9 1698 0.3184 0.013685

19


## 5 3366 59.92 653.1 1689 0.3599 0.017800

## 6 3366 30.82 605.6 1625 0.3055 0.009157

## 7 3265 39.76 579.1 1602 0.2979 0.012177

## 8 2959 39.76 552.8 1604 0.3199 0.013437

## 9 2959 38.39 521.4 1583 0.3196 0.012974

## 10 2959 39.76 480.1 1562 0.3330 0.013437

The result also shows we should use K = 2 for cluster analysis.

6.3 Cluster Analysis

After getting the best K = 2, we can continue to compare different clusters tosee if we can get interesting result.

From the previous tests, we find out that hclust is not performing good in thisanalysis. We found that one cluster only have few elements while another clusterhave over 99% elements. It means the Hclust function is also not working verywell in this case. So we will use Pam and Kmeans functions.

6.3.1 Pam

We apply pam function to the matrix, and set k = 2.

pamC = pam(x = m, 2)

pamC$clusinfo

## size max_diss av_diss diameter separation

## [1,] 561 2814 741.6 3505 75.14

## [2,] 439 3004 717.2 3899 75.14

pamcluster = data.frame(pamC$clustering)

We paste the cluster result back to our original dataset.

total = cbind(cd, pamcluster)

After that, we can obtain the two subset according to their cluster numbers.

d1 = subset(total, pamC.clustering == 1)

d2 = subset(total, pamC.clustering == 2)

summary(d1)

## DayofMonth DayOfWeek DepTime CRSDepTime

## Min. : 1.0 Min. :1.00 Min. :1052 Min. :1045

## 1st Qu.:10.0 1st Qu.:2.00 1st Qu.:1631 1st Qu.:1525

## Median :18.0 Median :3.00 Median :1809 Median :1715

20


## Mean :16.9 Mean :3.59 Mean :1802 Mean :1699

## 3rd Qu.:23.0 3rd Qu.:5.00 3rd Qu.:2004 3rd Qu.:1855

## Max. :31.0 Max. :7.00 Max. :2351 Max. :2308

## ArrTime CRSArrTime ActualElapsedTime CRSElapsedTime

## Min. : 2 Min. : 640 Min. : 33 Min. : 35

## 1st Qu.:1745 1st Qu.:1718 1st Qu.: 84 1st Qu.: 80

## Median :1941 Median :1914 Median :123 Median :115

## Mean :1826 Mean :1904 Mean :137 Mean :130

## 3rd Qu.:2136 3rd Qu.:2105 3rd Qu.:168 3rd Qu.:160

## Max. :2357 Max. :2359 Max. :441 Max. :407

## AirTime ArrDelay DepDelay Distance

## Min. : 14 Min. : 15.0 Min. :-10.0 Min. : 56

## 1st Qu.: 57 1st Qu.: 27.0 1st Qu.: 23.0 1st Qu.: 334

## Median : 88 Median : 47.0 Median : 45.0 Median : 590

## Mean :108 Mean : 69.3 Mean : 62.4 Mean : 720

## 3rd Qu.:137 3rd Qu.: 97.0 3rd Qu.: 89.0 3rd Qu.: 948

## Max. :382 Max. :395.0 Max. :377.0 Max. :2640

## pamC.clustering

## Min. :1

## 1st Qu.:1

## Median :1

## Mean :1

## 3rd Qu.:1

## Max. :1

summary(d2)


## Min. : 1.0 Min. :1.00 Min. : 14 Min. : 45

## 1st Qu.:12.0 1st Qu.:2.00 1st Qu.: 835 1st Qu.: 810

## Median :18.0 Median :3.00 Median :1021 Median : 955



## Max. :31.0 Max. :7.00 Max. :2333 Max. :2359


## Min. : 10 Min. : 1 Min. : 35 Min. : 34.0

## 1st Qu.:1022 1st Qu.: 942 1st Qu.: 92 1st Qu.: 83.5

## Median :1217 Median :1130 Median :129 Median :116.0

## Mean :1178 Mean :1111 Mean :146 Mean :134.6

## 3rd Qu.:1408 3rd Qu.:1322 3rd Qu.:176 3rd Qu.:165.0

## Max. :1810 Max. :2345 Max. :432 Max. :405.0


## Min. : 20 Min. : 15.0 Min. : -17.0 Min. : 74




21



## Max. :389 Max. :1015.0 Max. :1019.0 Max. :2777

## pamC.clustering

## Min. :2

## 1st Qu.:2

## Median :2

## Mean :2

## 3rd Qu.:2

## Max. :2

We can see from the summary that all the columns are similar except DepartureTime and our target variable Arrive time. We can conclude that when theDeparture Time is in the midnight or in the morning, it’s more likely that thisfight will have a relatively lower delay, which match the conclusion we drawfrom Association rules.

totaldf = data.frame(total)

totaldf$pamC.clustering = as.factor(totaldf$pamC.clustering)

qplot(data = totaldf, x = totaldf$pamC.clustering, y = totaldf$DepTime, colour = totaldf$pamC.clustering,

geom = "boxplot")

22


0

500

1000

1500

2000

1 2totaldf$pamC.clustering

tota

ldf$

Dep

Tim

e

totaldf$pamC.clustering

1

2

From the result, we know that pam has done a quite good job in clustering.Next, we are going to try Kmeans to compare the result.

6.3.2 Kmeans

We apply similar work to Kmeans to see if Kmeans works better than pamfunction.

kmeans.results = kmeans(m, 2)

clusterdf = data.frame(kmeans.results$cluster)

total = cbind(cd, clusterdf)

d1 = subset(total, kmeans.results.cluster == 1)

summary(d1)


## Min. : 1.0 Min. :1.00 Min. :1052 Min. :1045

23


## 1st Qu.:10.5 1st Qu.:2.00 1st Qu.:1627 1st Qu.:1520

## Median :18.0 Median :3.00 Median :1804 Median :1710



## Max. :31.0 Max. :7.00 Max. :2351 Max. :2308


## Min. : 2 Min. : 640 Min. : 33 Min. : 35

## 1st Qu.:1739 1st Qu.:1714 1st Qu.: 84 1st Qu.: 80




## Max. :2357 Max. :2359 Max. :441 Max. :407


## Min. : 14.0 Min. : 15.0 Min. :-10.0 Min. : 56

## 1st Qu.: 57.5 1st Qu.: 27.0 1st Qu.: 23.0 1st Qu.: 334

## Median : 88.0 Median : 47.0 Median : 45.0 Median : 588

## Mean :107.6 Mean : 69.8 Mean : 62.8 Mean : 720

## 3rd Qu.:136.5 3rd Qu.: 97.0 3rd Qu.: 88.5 3rd Qu.: 947

## Max. :382.0 Max. :425.0 Max. :392.0 Max. :2640

## kmeans.results.cluster

## Min. :1

## 1st Qu.:1

## Median :1

## Mean :1

## 3rd Qu.:1

## Max. :1

d2 = subset(total, kmeans.results.cluster == 2)

summary(d2)


## Min. : 1.0 Min. :1.00 Min. : 14 Min. : 45

## 1st Qu.:12.0 1st Qu.:2.00 1st Qu.: 834 1st Qu.: 805

## Median :18.0 Median :3.00 Median :1017 Median : 950

## Mean :17.1 Mean :3.54 Mean :1030 Mean : 992


## Max. :31.0 Max. :7.00 Max. :2333 Max. :2359


## Min. : 24 Min. : 1 Min. : 35 Min. : 34

## 1st Qu.:1021 1st Qu.: 940 1st Qu.: 92 1st Qu.: 83




## Max. :1810 Max. :2305 Max. :432 Max. :405


## Min. : 20 Min. : 15.0 Min. : -17.0 Min. : 74

24






## Max. :389 Max. :1015.0 Max. :1019.0 Max. :2777

## kmeans.results.cluster

## Min. :2

## 1st Qu.:2

## Median :2

## Mean :2

## 3rd Qu.:2

## Max. :2

It seems they are generating very similar results. It also shows a strong rela-tionship between Departure Time and Arrive Delay, which matches our findingsin Association Rules and Decision Trees.

7 Decision Tree

In this section, we are going to use decision tree to help us analyze the factorsthat will affect the target variables.First, we need to load the libraries required.

library(rpart)

library(rpart.plot)

library(rattle)

library(maptree)

library(party)

library(partykit)

7.1 Categorize Variable

We categorize our variable into different parts.

Distance We divided the variables into three parts: up to 750, 750 to 1000,greater than 1000

d$Distance = ordered(cut(d$Distance, c(0, 750, 1000, Inf)), labels = c("upto750",

"750to1000", ">1000"))

DayOfWeek Replace week days number into characters like 1=MON, 2=TUEetc.with the help of gsub

d$DayOfWeek = gsub("1", "MON", d$DayOfWeek)

d$DayOfWeek = gsub("2", "TUE", d$DayOfWeek)

25


d$DayOfWeek = gsub("3", "WED", d$DayOfWeek)

d$DayOfWeek = gsub("4", "THU", d$DayOfWeek)

d$DayOfWeek = gsub("5", "FRI", d$DayOfWeek)

d$DayOfWeek = gsub("6", "SAT", d$DayOfWeek)

d$DayOfWeek = gsub("7", "SUN", d$DayOfWeek)

Origin Origins of airports are categorized into SW, SE, NE, MW, W these fiveregions with the help of gsub function.

d$Origin = gsub("ABQ|AMA|AUS|CRP|DAL|ELP|HOU|HRL|LBB|OKC|SAT|TUS|TUL|MAF|IAH|DFW|BRO|CHS|TYS|GSO|CVG|MEM|LEX|AVL|GSP|CAE|MFE|FLG|YUM|TEX|GRK|LAW|GGG|ABI|TYR|ROW|SJT|PHX|LRD",

"SW", d$Origin)

d$Origin = gsub("BHM|BNA|BWI|FLL|IAD|JAN|JAX|LIT|MCO|RDU|TPA|ORF|PBI|SDF|RSW|ATL|RIC|MIA|CLT|DCA|SAV|XNA|GPT|VPS|MOB|PNS|CRW|HSV|ILM|ROA|FAY|SRQ|OAJ|HTS|CHA|TRI|IDA|TLH|SJU|STT|MYR|STX|DAB|AGS|MLB|PFN|EYW|PHF|CSG|DHN|BQK|VLD|EWN|ABY|MEI|GNV|GTR|FSM|CHO|FLO|LYH|MGM|TXK|PSG|BQN",

"SE", d$Origin)

d$Origin = gsub("BDL|BUF|ISP|MHT|PHL|PIT|PVD|ALB|ROC|EWR|BTV|BGR|SYR|BOS|ABE|PWM|LGA|JFK|MDT|COD|AVP|AVP|AVP|SWF|ERI|ELM|SCE|BGM|ITH|SCC|CPR|HPN",

"NE", d$Origin)

d$Origin = gsub("CLE|CMH|DTW|IND|MCI|MDW|STL|OMA|MKE|DAY|DSM|GRR|ORD|MSP|MSN|ICT|ATW|CAK|CID|CWA|FSD|GRB|LAN|MBS|MLI|SBN|SPI|TVC|RAP|FWA|FNT|DLH|AZO|LNK|SGF|BIS|FAR|PIA|EVV|BMI|CLL|ACT|CMI|RST|MQT|LSE|DBQ|TOL|GFK|MOT|ALO|CMX|RHI|PIR|LWS|PSC",

"MW", d$Origin)

d$Origin = gsub("BOI|BUR|DEN|LAS|LAX|MSY|OAK|ONT|PDX|RNO|SAN|SEA|SFO|SJC|SLC|SMF|SNA|GEG|LFT|BTR|COS|LCH|SHV|AEX|MLU|BFL|FAT|GJT|GUC|HNL|ITO|KOA|LGB|LIH|MRY|OGG|SBA|PSP|EGE|GCC|JAC|MTJ|ASE|DRO|SBP|RKS|BIL|HLN|BZN|MSO|FCA|GTF|BTM|HDN|ACV|EUG|MFR|RDM|PMD|IYK|OXR|SMX|RDD|LMT|CEC|SGU|MOD|OTH|CIC|CLD|IPL|CDC|EKO|PIH|SUN|TWF|ANC|BET|JNU|OTZ|OME|KTN|FAI|BRW|SIT|ADQ|WRG|CDV|YAK|ADK",

"W", d$Origin)

Dest Destination of airports are categorized into SW, SE, NE, MW, W thesefive regions with the help of gsub function

d$Dest = gsub("ABQ|AMA|AUS|CRP|DAL|ELP|HOU|HRL|LBB|OKC|SAT|TUS|TUL|MAF|IAH|DFW|BRO|CHS|TYS|GSO|CVG|MEM|LEX|AVL|GSP|CAE|MFE|FLG|YUM|TEX|GRK|LAW|GGG|ABI|TYR|ROW|SJT|PHX|LRD",

"SW", d$Dest)

d$Dest = gsub("BHM|BNA|BWI|FLL|IAD|JAN|JAX|LIT|MCO|RDU|TPA|ORF|PBI|SDF|RSW|ATL|RIC|MIA|CLT|DCA|SAV|XNA|GPT|VPS|MOB|PNS|CRW|HSV|ILM|ROA|FAY|SRQ|OAJ|HTS|CHA|TRI|IDA|TLH|SJU|STT|MYR|STX|DAB|AGS|MLB|PFN|EYW|PHF|CSG|DHN|BQK|VLD|EWN|ABY|MEI|GNV|GTR|FSM|CHO|FLO|LYH|MGM|TXK|PSG|BQN",

"SE", d$Dest)

d$Dest = gsub("BDL|BUF|ISP|MHT|PHL|PIT|PVD|ALB|ROC|EWR|BTV|BGR|SYR|BOS|ABE|PWM|LGA|JFK|MDT|COD|AVP|AVP|AVP|SWF|ERI|ELM|SCE|BGM|ITH|SCC|CPR|HPN",

"NE", d$Dest)

d$Dest = gsub("CLE|CMH|DTW|IND|MCI|MDW|STL|OMA|MKE|DAY|DSM|GRR|ORD|MSP|MSN|ICT|ATW|CAK|CID|CWA|FSD|GRB|LAN|MBS|MLI|SBN|SPI|TVC|RAP|FWA|FNT|DLH|AZO|LNK|SGF|BIS|FAR|PIA|EVV|BMI|CLL|ACT|CMI|RST|MQT|LSE|DBQ|TOL|GFK|MOT|ALO|CMX|RHI|PIR|LWS|PSC",

"MW", d$Dest)

d$Dest = gsub("BOI|BUR|DEN|LAS|LAX|MSY|OAK|ONT|PDX|RNO|SAN|SEA|SFO|SJC|SLC|SMF|SNA|GEG|LFT|BTR|COS|LCH|SHV|AEX|MLU|BFL|FAT|GJT|GUC|HNL|ITO|KOA|LGB|LIH|MRY|OGG|SBA|PSP|EGE|GCC|JAC|MTJ|ASE|DRO|SBP|RKS|BIL|HLN|BZN|MSO|FCA|GTF|BTM|HDN|ACV|EUG|MFR|RDM|PMD|IYK|OXR|SMX|RDD|LMT|CEC|SGU|MOD|OTH|CIC|CLD|IPL|CDC|EKO|PIH|SUN|TWF|ANC|BET|JNU|OTZ|OME|KTN|FAI|BRW|SIT|ADQ|WRG|CDV|YAK|ADK",

"W", d$Dest)

DayOfMonth We divided December Day Of Month into Regular day andChristmas Week

d$DayofMonth = ordered(cut(d$DayofMonth, c(0, 23, 32)), labels = c("R.Days",

"CH.Days"))

DepDelay

We divided Departure Delay into two part low and high Delay

26


d$DepDelay = ordered(cut(d$DepDelay, c(-Inf, 60, Inf)), labels = c("low", "high"))

7.2 Rpart

Rpart is recursive partitioning for classification, regression and survival trees.We are going to classify two predict variable ArrDelay and DepDelay byusing rpart.

Departure Delay: DepDelay is response variable and DayofMonth, DayOfWeek,DepTime, Distance are predicate variable.

ss.formula = DepDelay ~ DayofMonth + DayOfWeek + DepTime + Distance

# formula for tree

ss.rpart = rpart(data = d, formula = ss.formula)

draw.tree(ss.rpart, nodeinfo = TRUE) # for actual tree draw

DepTime <> 1406.5low; 168647 obs; 30.1%

DepTime >< 447.5low; 72390 obs; 19.8%

low70766 obs

1

high1624 obs

2

DepTime <> 2229.5low; 96257 obs; 37.9%

low91550 obs

3

high4707 obs

4

Total classified correct = 27.5 %

27


print(ss.rpart) # for printing tree rules

## n= 168647

##

## node), split, n, loss, yval, (yprob)

## * denotes terminal node

##

## 1) root 168647 50820 low (0.6987 0.3013)

## 2) DepTime< 1406 72390 14340 low (0.8019 0.1981)

## 4) DepTime>=447.5 70766 13030 low (0.8158 0.1842) *

## 5) DepTime< 447.5 1624 314 high (0.1933 0.8067) *

## 3) DepTime>=1406 96257 36480 low (0.6210 0.3790)

## 6) DepTime< 2230 91550 33200 low (0.6374 0.3626) *

## 7) DepTime>=2230 4707 1427 high (0.3032 0.6968) *

From this tree we conclude that normally at night between 10:30PM to 5:00AMdelays are more as compare to day time. From this tree we can conclude thatour decision tree is mainly depend on Arrival time and Departure time. So thatwe removed AirTime from next decision tree.

Arrival Delay: ArrDelay is response variable and DayofMonth, DayOfWeek,Origin, Distance are predicate variable.

ss.formula = ArrDelay ~ DayofMonth + DayOfWeek + Distance + Origin

# formula for tree

R.control = rpart.control(cp = 0.001) # to control tree

ss.rpart = rpart(data = d, formula = ss.formula, control = R.control)

draw.tree(ss.rpart, nodeinfo = TRUE) # for actual tree draw

28


,SE,SW,W = Origin = ,MW,NE62.5547 ; 168647 obs; 0.6%

,MON,SUN,THU,TUE,WED = DayOfWeek = ,FRI,SAT59.0561 ; 110460 obs; 0.1%

,SE,SW = Origin = ,MW,NE,W57.44 ; 81657 obs; 0.1%

54.7998 50291 obs

1

61.6731 31366 obs

2

,R.Days = DayofMonth = ,CH.Days63.6379 ; 28803 obs; 0.2%

58.8759 17922 obs

3

71.4812 10881 obs

4

,MON,SUN,THU,WED = DayOfWeek = ,FRI,SAT,TUE69.1963 ; 58187 obs; 0.4%

,THU = DayOfWeek = ,FRI,MON,SAT,SUN,TUE,WED63.6194 ; 33946 obs; 0.1%

52.3489 5849 obs

5

65.9656 28097 obs

6

77.006 24241 obs

7

Total deviance explained = 1.5 %

print(ss.rpart) # for printing tree rules

## n= 168647

##

## node), split, n, deviance, yval

## * denotes terminal node

##

## 1) root 168647 675500000 62.55

## 2) Origin=SE,SW,W 110460 394400000 59.06

## 4) DayOfWeek=MON,SUN,THU,TUE,WED 81657 273300000 57.44

## 8) Origin=SE,SW 50291 142800000 54.80 *

## 9) Origin=W 31366 129600000 61.67 *

## 5) DayOfWeek=FRI,SAT 28803 120300000 63.64

## 10) DayofMonth=R.Days 17922 68280000 58.88 *

## 11) DayofMonth=CH.Days 10881 50950000 71.48 *

## 3) Origin=MW,NE 58187 277100000 69.20

## 6) DayOfWeek=MON,SUN,THU,WED 33946 124100000 63.62

29


## 12) DayOfWeek=THU 5849 17040000 52.35 *

## 13) DayOfWeek=MON,SUN,WED 28097 106100000 65.97 *

## 7) DayOfWeek=FRI,SAT,TUE 24241 150500000 77.01 *

From this decision tree we can say that our dataset is divided into parts likedivision of origin into SE,SW,W and MW,NE as well as day of week intoMON,SUN,THU,TUE,WED and FRI,SAT.

7.3 Ctree

Ctree is Conditional inference trees which embed tree-structured regressionmodels into a well defined theory of conditional inference procedure

Departure Delay:

ss.formula1 = DepDelay ~ Distance + DepTime # formula for Ctree

ss.control = ctree_control(maxdepth = 2) #height is 2

ss.ctree = ctree(data = d, formula = ss.formula1, control = ss.control) # tree creation

## Loading required package: Formula

## Warning: there is no package called ’Formula’

plot(ss.ctree) # plotting of tree

30


DepTimep < 0.001

1

≤ 1406 > 1406

DepTimep < 0.001

2

≤ 447 > 447

Node 3 (n = 1624)

high

low

0

0.2

0.4

0.6

0.8

1Node 4 (n = 70766)

high

low

0

0.2

0.4

0.6

0.8

1

DepTimep < 0.001

5

≤ 2229 > 2229

Node 6 (n = 91550)hi

ghlo

w

0

0.2

0.4

0.6

0.8

1Node 7 (n = 4707)

high

low

0

0.2

0.4

0.6

0.8

1

Like we explain in rpart, ctree is giving same result which is normally at nightbetween 10:30PM to 5:00AM delays are more as compare to day time.

Arrival Delay:

ss.formula1 = ArrDelay ~ Distance + ArrTime # formula for Ctree

ss.control = ctree_control(maxdepth = 2) # height is 2

ss.ctree = ctree(data = d, formula = ss.formula1, control = ss.control) # tree creation

## Loading required package: Formula

## Warning: there is no package called ’Formula’

plot(ss.ctree) # plotting of tree

31


ArrTimep < 0.001

1

≤ 518 > 518

ArrTimep < 0.001

2

≤ 134 > 134

Node 3 (n = 7693)

0

500

1000

1500

Node 4 (n = 2165)

0

500

1000

1500

ArrTimep < 0.001

5

≤ 1438 > 1438

Node 6 (n = 52990)

0

500

1000

1500

Node 7 (n = 105799)

0

500

1000

1500

From this tree we can conclude that around midnight like before 5:18 AM, delaysare higher compared to day time.

8 Random Forest

Now we will use random forest analysis to learn more about predictions. In therandom forest, the following libraries will be used.

library(randomForest) # for randomForest

library(rpart)

library(caret) # for confusionMatrix

Because the original data is too large, we still randomly select 1000 rows.

32


rfd = rd[sample(nrow(rd), 1000), ]

We seperate our dataset into train set and test set.

ndxTrain = sample(x = nrow(rfd), size = 0.7 * nrow(rfd))

rfd.train = rfd[ndxTrain, ]

rfd.test = rfd[-ndxTrain, ]

We set all the other variables to be predictors and see how they will affect ourtarget variable.

rfd.predictors = c("DayofMonth", "DayOfWeek", "DepTime", "CRSDepTime", "ArrTime",

"CRSArrTime", "AirTime", "ActualElapsedTime", "Distance")

rfd.rf = randomForest(x = rfd.train[, rfd.predictors], y = rfd.train$ArrDelay)

print(rfd.rf)

##

## Call:

## randomForest(x = rfd.train[, rfd.predictors], y = rfd.train$ArrDelay)

## Type of random forest: classification

## Number of trees: 500

## No. of variables tried at each split: 3

##

## OOB estimate of error rate: 24.14%

## Confusion matrix:

## Low High class.error

## Low 272 80 0.2273

## High 89 259 0.2557

plot(rfd.rf)

33


0 100 200 300 400 500

0.20

0.25

0.30

0.35

0.40

rfd.rf

trees

Err

or

From the diagram, we find out that the error rate will be stable when the treenumbers get larger. So we use the default tree number, which is 500.

rfd.train.pred = predict(object = rfd.rf, newdata = rfd.train, type = "class")

rfd.test.pred = predict(object = rfd.rf, newdata = rfd.test, type = "class")

confusionMatrix(data = rfd.train.pred, reference = rfd.train$ArrDelay)

## Confusion Matrix and Statistics

##

## Reference

## Prediction Low High

## Low 352 0

## High 0 348

##

## Accuracy : 1

## 95% CI : (0.995, 1)

34


## No Information Rate : 0.503

## P-Value [Acc > NIR] : <2e-16

##

## Kappa : 1

## Mcnemar's Test P-Value : NA

##

## Sensitivity : 1.000

## Specificity : 1.000

## Pos Pred Value : 1.000

## Neg Pred Value : 1.000

## Prevalence : 0.503

## Detection Rate : 0.503

## Detection Prevalence : 0.503

## Balanced Accuracy : 1.000

##

## 'Positive' Class : Low

##

confusionMatrix(data = rfd.test.pred, reference = rfd.test$ArrDelay)

## Confusion Matrix and Statistics

##

## Reference

## Prediction Low High

## Low 133 32

## High 29 106

##

## Accuracy : 0.797

## 95% CI : (0.747, 0.841)

## No Information Rate : 0.54

## P-Value [Acc > NIR] : <2e-16

##

## Kappa : 0.59

## Mcnemar's Test P-Value : 0.798

##

## Sensitivity : 0.821

## Specificity : 0.768

## Pos Pred Value : 0.806

## Neg Pred Value : 0.785

## Prevalence : 0.540

## Detection Rate : 0.443

## Detection Prevalence : 0.550

## Balanced Accuracy : 0.795

##

## 'Positive' Class : Low

##

35


Although the dataset is randomly chosen, we can always get an accuracy rateof over 70 percent, which is higher than a single decision tree.

9 Classification

Classification technique used to predict group membership for data instances.ForClassification we are using knn and svm algorithms

9.1 knn

K-Nearest Neighbors(Knn) is supervised machine learning algorithm for objectclassification.

library(class) #for knn

library(RWeka) #for IBk function

## Error: package or namespace load failed for ’RWeka’

9.2 Processing Data

For removing not useful column.

kd = kd[, -20:-29]

kd = kd[, -1:-2]

kd = kd[, -3:-12]

kd = kd[, -4]

kd = kd[, -5]

We are keeping ArrDelay as our responsive variable so we categorize that intotwo part low and high delay.

kd$ArrDelay = ordered(cut(kd$ArrDelay, c(14, 40, Inf)), labels = c("Low", "High"))

Our machine is not able to handle full data set so we are using 1000 randomrecord.

kdd = kd[sample(nrow(kd), 1000), ] # sample dataset

The IBk function implements the K-NN technique to predict the Arrival Delayvariable from the remaining four variables of the kdd dataframe.That’s why weare using this function and storing result into classifier

classifier = IBk(ArrDelay ~ DayOfWeek + DayofMonth + Distance + Origin, data = kdd,

control = Weka_control(K = 4)) # k=4 because 4 other variable

36


## Error: could not find function "IBk"

summary(classifier) # detail eplanation with confusion matrix

## Error: error in evaluating the argument ’object’ in selecting a

method for function ’summary’: Error: object ’classifier’ not found

In k nearest neighbour technique we found that our around 70% data are cor-rectly classified and only 30% data are incorrectly classified. In confusion matrixwe can see that in high delay part is not classified properly.

9.3 SVM

For classification we are using another method which is SVM. Support VectorMachine can analyze data and recognize patterns, used for classification andregression analysis.

10 Processing Data

We are keeping ArrDelay as our response variable so we categorize that into twopart low and high.

sd$ArrDelay = ordered(cut(sd$ArrDelay, c(14, 40, Inf)), labels = c("Low", "High"))

Our machine is not able to handle full data set so we took part of that

sdd = sd[sample(nrow(sd), 1000), ] # sample dataset

We divided our dataset Into two parts train1 and test1 dataset

sd1 = nrow(sdd)

nxd.train = sample(1:sd1, 0.7 * sd1)

sd.train1 = sdd[nxd.train, ]

sd.test1 = sdd[-nxd.train, ]

For SVM we are using these two libraries.

library(e1071)

library(caret)

Predict variable is ArrDelay which is based on two variable which are Day-OfWeek and Distance.

sd.formula = ArrDelay ~ DayOfWeek + Distance

plot.formula = DayOfWeek ~ Distance #For plot X and Y axis

sd.model = svm(formula = sd.formula, data = sd.train1) # for actual model creation.

summary(sd.model) # Detail description of a model.

37


##

## Call:

## svm(formula = sd.formula, data = sd.train1)

##

##

## Parameters:

## SVM-Type: C-classification

## SVM-Kernel: radial

## cost: 1

## gamma: 0.5

##

## Number of Support Vectors: 639

##

## ( 322 317 )

##

##

## Number of Classes: 2

##

## Levels:

## Low High

sd.predict = predict(sd.model, sd.test1) # prediction on testing data set

# confusionMatrix(data = sd.predict, reference = sdd£ArrDelay)

plot(x = sd.model, data = sd.train1, formula = plot.formula) #default: cost=1, gamma=0.5

38


Low

Hig

h

500 1000 1500 2000 2500

1

2

3

4

5

6

7

o

o

o

o

o

o

o

o

o

oo

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo o

o

o

oo

o

oo

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o o

oo

o

oo

x

x

x

xx

x

x

x x

x x

x

x

x

x

x

x

x x

x

x x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xxx

x

x

x

x

x

x

x

x x

x

x

x

x

x x

x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x xx

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xxx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x x x

x

x

x

x

x

x

x

x

x

x x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x x

x

xx

x

xx

x

x x

xx

x

x

x

x

x

x

x

x

xx

x

xx

x

x

xx

x

x

x

x

x x

x

x

x

x

x

x

x

x

xx

x

x

x

x

xx x

x

x

x

x

x

x x

x

x

x

xx x

x

x

x

x

x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

xx

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

xx

x

x

x

x

x

x

x

x x

x

x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

xxx

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

xxx x

x

xx

x

x

x

x

x

x

x

x

xxx

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x x

xx

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

SVM classification plot

Distance

Day

OfW

eek

For clear result we are changing cost and gamma parameter

sd.model = svm(formula = sd.formula, data = sd.train1, method = "C-classification",

kernel = "radial", cost = 1, gamma = 5)

plot(x = sd.model, data = sd.train1, formula = plot.formula)

39


Low

Hig

h

500 1000 1500 2000 2500

1

2

3

4

5

6

7

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o o

o

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

x

xx

x

xx

x

x

x x

x x

x

x

x

x

x

x

x

x

x x

x

x x

xx

x

x

xx

x

x

x

x

x

x

x

x

x

x

x x

x x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x x

x

x x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x x

x

xx

x x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

xxx

x

x

x

x

x

x

x

x

x x

x

x

x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x xx

x

xx

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xxx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x x x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x xx

x

x

x

x

x

x

x

xx

x

xx

x

x x

xx

x

x

x

x

x

x

x

x

x

x

xx

x

xx

x

x

xx

x

x

x

x x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

xx x

x

x

x

x

x

x

x

x

x

x

x

xx x

x

x

x

x

x

x

x x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

xx

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x x

x

x

x

x

x

x

xx

x

xx

x

x

x

x

x

x

x

x

x x

x

x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

xxx

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

xxx x

x

xx

x

x

x

x

x

x

x

x

x

xxx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx x

x

x x

x

x

x

x

x

x

x

x

x

x

x x

x

xx

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

xx

x

x

x

x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x


Distance

Day

OfW

eek

sd.model = svm(formula = sd.formula, data = sd.train1, method = "C-classification",

kernel = "radial", cost = 1, gamma = 0.1)

plot(x = sd.model, data = sd.train1, formula = plot.formula)

40


Low

Hig

h

500 1000 1500 2000 2500

1

2

3

4

5

6

7

oo

o

o o

oo o o

oo

o

o

oo

oo oooo o

o

o

o

ooo

x

xx

x

xx

x

x

x

x

x

x x

x

x

x

x

x x

x

x

x

x

x x

x

x x

xx

x

x

xx

x

x

x

x

x

x

x

x x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x x

x

xx

x x

x

x

x

x x

xx

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

xxx

x

x

x

x

x

x

x

x

x x

x

x

x

x

x

x x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x xx

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x x x

x

x

x

x

x

x

x

x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x x

x

x

x

x x

x

xx

x

xx

x

x x

xx

x

x

x

x

x

x

x

x

x

x

xx

x

xx

x

x

xx

x

x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx x

x

x

x

x

x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x

x x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

xx

xx

x

x

x

xx

xx

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x x

x

x

x

x

x

x

x

x

xx

x

xx

x

x

x

x

x

x

x

x x

x

x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x

xx

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

xxx

x

x

x

x

xx

x

x

x

x

x

xx

x

x

x

x

x

x

x

xxx x

x

xx

x

x

x

x

x

x

x

x

x

xxx

x

x

x

x

x

x

x

x

xx

x

x

x

x

x

x

xx x

x

x

x x

x

x

x

x

x

x

x

x

x

x

x

x x

x

xx

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

xx

x

x

x

x

xx

x

x

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x


Distance

Day

OfW

eek

From this graph we can clearly say that on 6th and 7th day which is Saturdayand Sunday is having more delay than rest of the day. After certain distanceArrival Delays are getting lower.

11 Conclusion

Overall, we did find some useful results in our analysis. Within the logistic re-gression, we found several variables that were statistically significant, like Dis-tance, Day Of Week, Origin, Destination, Departure-time, Arrival-time, DayOf Month and etc. We converted them from numeric variable to categoricalvariables. We found out some reasons behind the U.S Air flights Arrival Delay.

To find out relationships between different variables, we applied association rulesand we got really good result. We found out on Monday flights are more likely

41


to be on time. In December 2008, we found out Chicago flights delays are highso we check weather records and we found that In December 2008, weather wasvery bad in Chicago. Many flights are affected because of that reason.

The fastest and easiest way to make decision about our dataset is to apply De-cision tree mechanism where we organized our data hierarchically. So, we usedRpart and Ctree algorithm for making decision tree. From the diagram, weconclude that normally at night between 10:30PM to 5:00AM delays are rela-tively higher. And normally on weekends like Saturday and Sunday delays arepossibly higher compared with weekdays, which makes sense.

On the other hand, we also use clustering analysis. After applying Kmeans,Hclust and Pam clustering method to our dataset, we got cluster 2 is the bestcluster and we validate that by using clvalid function and other measurements.After separating the dataset into two subsets, we also find out some relation-ships that fulfill what we have found in the association rules.

In classification we used knn and svm techniques. In k-nearest neighbor tech-nique we found that approximately 70 percent of data are correctly classified.Same thing we found in our confusion matrix.

We did some pattern recognition with help of Support Vector Machine (SVM)where we found that the longer distance, the longer delay.

According to our analysis, we suggest that it is better to choose daytime onweekdays to travel so that you can arrive your destination on time.

12 Limitation

There are still a few limitations during our analysis for this dataset.

First, we are limited by our computers processing capability. The originaldataset is huge which contains 7,009,728 observations, so we select a part ofit (All U.S. airlines data of December, 2008) so as to reduce the file size loadinginto R. In addition, when we especially address with PAM algorithm in clusteranalysis and RandomForest, it often got stuck even crashed. So we have to usea random sample to apply functions, like computing distance matrix. However,we haven’t verify how the random sample will affect our result.

Another limitation is that so far we only focus on the delay. There might be

42


other interesting relationships among other variables. We may work on otherrelationships in the future.

13 Future Work

While we have already obtained some analysis outcomes, there are still a fewworks we can do in the future.

First, due to the limitation of our computers, we are not able to process large-scale data. So we cannot apply some of the functions on the full dataset. In ouranalysis, we only use a random sample, so the result cannot be accurate everytime.

What’s more, we may find out more relationships because our target is ana-lyzing the air flights perform on-time or delayed. Something valuable is stillwaiting for us. For instance, we may find the busiest carrier in the air.

Last but not the least, from previous work, we find out that the DBSCAN isnot working well unless the dataset becomes very large. So we don’t apply thatto our dataset. We would like to see how DBSCAN will perform and we wantto compare the result of DBSCAN to the other cluster methods.

43

data mining & analytics for u.s. airlines on-time performance

Data & Analytics

data analysis of

time percentage

preprocessing data

time ight operation

airlines data collection

peoples air time

crsarrtime actual arrival

dataset description