machine learning in aviation - computational...
TRANSCRIPT
Intro Supervised Clustering Conclusions
MACHINE LEARNING IN AVIATION
PEDRO LARRANAGA
Computational Intelligence GroupArtificial Intelligence Department
Technical University of Madrid, Spain
Toulouse, July 11, 2013
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Outline
1 Introduction
2 Supervised Classification
3 Clustering
4 Conclusions
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Outline
1 Introduction
2 Supervised Classification
3 Clustering
4 Conclusions
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Introduction
Data in aviationData in our world: smartphones, social networks, atmospheric readings, financialdata, medical records, bioinformatics, airplane instrumentation, etc.
Estimation of 2.5 exabytes of data generated on the planet every day
Data in aviation:
Flight trackingWeather conditionsAirport informationAirline informationMarket informationPassenger informationAir safety reportsAircraft data
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Introduction
Data in USA cross-country commercial flights
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Introduction
Big Data. The five V’s: volume, variety, velocity, viability and value
DAVENPORT, TH (2013). At the Big Data Crossroads: Turning Towards a Smarter Travel Experience. Amadeus ITGroup
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Introduction
Aviation industryCharacteristics: Large scale and unstructured mixture of data in a variety of dataformats
RadarWeather (spatial-temporal)Terrain (spatial)InfrastructureText
Benefits:
Help airlines increase salesCustomer loyaltyImprove fuel management and efficiencyManage fleets more effectivelyImprove customer service
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Pattern recognition (PR) a step in knowledge discovery in databases (KDD)
PR a part of KDD
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Outline
1 Introduction
2 Supervised Classification
3 Clustering
4 Conclusions
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Supervised: From labelled data to classification modelsPredictor variables (attributes) and one labelled (class) variable:
X1 . . . Xn C(x (1), c(1)) x (1)
1 . . . x (1)n c(1)
(x (2), c(2)) x (2)1 . . . x (2)
n c(2)
. . . . . . . . .
(x (N), c(N)) x (N)1 . . . x (N)
n c(N)
x (N+1) x (N+1)1 . . . x (N+1)
n ???
Construct a classification model that predicts with highaccuracy the class of a new instance only characterized by thepredictor variables
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Linear decision boundary
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Non linear decision boundary
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Disease diagnosis
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Prediction of the secondary structure of proteins
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Fraudulent use of credit cards
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Spam email
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Handwritten character recognition
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Weather forecasting
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Less is more. Dimensionality reduction with principal component analysis (PCA)
Directions of max variance of the data
3-D data, but distributed mostly on a 2-D surface⇒ Find it
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Less is more. Feature subset selection (FSS)
Requirement: Scoring function to measure the quality of a subset
Filter approach: intrinsic characteristics of the data
Wrapper approach: knowledge about the classifier
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Less is more. Feature subset selection (FSS)
FSS as a combinatorial optimization problem: search strategies (forward,backward, genetic algorithms, ....)
Cardinality of the search space: 2n
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Measuring the performance. Confusion matrix
C True class+ -
CM Predicted class + a b- c d
Figures of meritAccuracy: a+d
a+b+c+d
Error rate: c+ba+b+c+d
Rate of true positives (sensitivity): aa+c
Rate of true negatives (specificity): db+d
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
ROC curve. Area under the ROC curve (AUC)
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Estimation methods. No honest
pM =1N
N∑i=1
δ(c(i) = c(i)M )
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Estimation methods. Train and test
pM =1
N − N1
N−N1∑i=1
δ(c(N1+i) = c(N1+i)M )
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Estimation methods. k–fold cross validation
pM =1k
k∑i=1
pi
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Classifiers. k -NEAREST NEIGHBORS
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Classifiers. CLASSIFICATION TREE
Equivalent set of rules:
R1: If (Outlook=Sunny) and (Humidity=High) then PLAYTENNIS=NoR2: If (Outlook=Sunny) and (Humidity=Normal) then PLAYTENNIS=YesR3: If (Outlook=Overcast) then PLAYTENNIS=YesR4: If (Outlook=Rain) and (Wind=Strong) then PLAYTENNIS=NoR5: If (Outlook=Rain) and (Wind=Weak) then PLAYTENNIS=Yes
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Classifiers. NAIVE BAYES
Predictor variables are conditionally independent given C
P(c|x1, ..., xn) ∝ P(C = c)n∏
i=1
P(Xi = xi |C = c)
⇒ c∗ = arg maxcP(C = c)∏n
i=1 P(Xi = xi |C = c)
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Classifiers. LOGISTIC REGRESSION
πj = P(C = 1|xj) =1
1+e−(β0+β1xj1+···+βnxjn)
⇒ 1− πj = P(C = 0|xj) =1
1+e(β0+β1xj1+···+βnxjn)
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Classifiers. NEURAL NETWORK
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Classifiers. SUPPORT VECTOR MACHINE
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Classifiers. METACLASSIFIERS
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Classifiers. METACLASSIFIERS. RANDOM FOREST
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Case study on diagnose aviation turbulence
Predominant cause of accidentsand injuries
Costing airlines millions of eurosper year in compensation: aircraftdamage, and delays due topost-event inspections andrepairs
Attempts to avoid turbulentairspace cause: flight delays anden route deviations, increasing airtraffic controller workload, disruptschedules of air crews andpassengers and use extra fuel
WILLIAMS JK (2013). Using random forest to diagnose aviation turbulence. Machine Learning, in press
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Case study on diagnose aviation turbulenceData sets (March 10 to November 4, 2010)
UAL from 95 United Airlines, Boeing 757: 5,623,738 instancesDAL from 80 Delta Air Lines, Boeing 737: 6,595,922 instancesPredictor variables: Numerical weather predictors, Dopller radar,Geostationary satellite images, Lightning detection network, Derived fieldsand featuresClass variable: Eddy dissipation rate (EDR) an aircraft-independentatmospheric turbulence metric
Feature subset selection: wrapper forward/backward selection guided by theAUC
Supervised classification methods: k -nearest neighbor (k = 100); logisticregression; and random forest (200 trees)
Validation: 10-fold cross-validation repeated 32 times with AUC as performancemetric
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Case study on diagnose aviation turbulence
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Supervised classification
Case study on diagnose aviation turbulence
PODy= TPR; PODn= FPR; Random forest (blue) with AUC=0.924; k -NN (green) with AUC=0.915; logisticregression (green) with AUC=0.915; Graphical turbulence guidance (magenta) with AUC=0.816; stormdistance (magenta) with AUC=0.743
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Multi-label classification
X1 X2 X3 X4 X5 C3.2 1.4 4.7 7.5 3.7 12.8 6.3 1.6 4.7 2.7 07.7 6.2 4.1 3.3 7.7 19.2 0.4 2.8 0.5 3.9 05.5 5.3 4.9 0.6 6.6 1
X1 X2 X3 X4 X5 C1 C2 C3 C43.2 1.4 4.7 7.5 3.7 1 0 1 12.8 6.3 1.6 4.7 2.7 0 0 1 07.7 6.2 4.1 3.3 7.7 1 0 1 19.2 0.4 2.8 0.5 3.9 0 1 0 05.5 5.3 4.9 0.6 6.6 1 1 0 1
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Multi-label classification
Case study on the ASRS databaseThe Aviation Safety Reporting System (ASRS) database spans 30 years andcontains over 700,000 aviation safety reports in free text form
Primary concern is system health and safety: detection, diagnosis, prediction,mitigation, and prevention of ongoing and future system problems
ARSR reports are publicly available and are written by pilots, flight controllers,technicians, flight attendants, and others including passengers
Predictor variables, X1 to X25,729: 25,729 terms that occur in at least onedocument
Class variables to be predicted, C1 to C60: 60 problem types that appear duringflights
A subset of 28,596 documents containing 22 classes was used
OZA N, ET AL. (2009). Classification of aeronautics system health and safety documents. IEEE Transactions onSystems, Man, and Cybernetics, Part C, 39(6), 670-680
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Multi-label classification
Case study on the ASRS database
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Multi-label classification
Case study on the ASRS database
Correlations between pairs of class variables
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Multi-label classification
Case study on the ASRS database
AUC performances over the 60 univariate classification problems
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Outline
1 Introduction
2 Supervised Classification
3 Clustering
4 Conclusions
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Clustering
A general dataset
X1 . . . Xn
x (1) x (1)1 . . . x (1)
n. . . . . .
x (N) x (N)1 . . . x (N)
n
Grouping objects in clustersHigh similarity within each clusterHigh dissimilarity between the different clustersMain methods:
Hierarchical clusteringPartitional clusteringProbabilistic clustering
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Clustering
HIERARCHICAL CLUSTERING
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Clustering
PARTITIONAL CLUSTERING: k -means
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Clustering
PROBABILISTIC CLUSTERING: finite mixture models with EM
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Clustering
Case study on detecting anomalous landing
A key question on the aviationsafety domain
For a specific aircraft make andmodel at a specific airport
Variables: sequence of switchesthat a pilot flips during the courseof the landing phase of the flight
Data set: N= 2,200 flights, and n=1,500 possible switches
Cluster algorithm: k -medoids
Anomalous landing: a sequencethat is far away from the clustercentroid
BUDALAKOTI ET AL. (2009). Anomaly detection and diagnosis algorithms for discrete symbol sequences withapplications to airline safety. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 39(1), 101-113
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Outline
1 Introduction
2 Supervised Classification
3 Clustering
4 Conclusions
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
Conclusions
Machine learning in aviationAviation industry generates large scale dataTransform these data sets into knowledgeMachine learning methods:
Supervised classificationClustering
Advances in the safety, security, and efficiency of civilaviation
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
References on Supervised ClassificationBIELZA C, LI G, LARRANAGA P (2011). Multi-dimensional classification withBayesian networks. International Journal of Approximate Reasoning, 52,705-727BISHOP CM (2006). Pattern Recognition and Machine Learning. SpringerBISHOP CM (1995). Neural Networks for Pattern Recognition. Oxford UniversityPressBREIMAN L, FRIEDMAN J, OLSHEN R, STONE C (1984). Classification andRegression Trees. Chapman and HallCOVER TM, HART PE (1967). Nearest neighbor pattern classification. IEEETransactions on Information Theory, 13(1), 21-27HASTIE H, TIBSHIRANI R, FRIEDMAN J (2001). The Elements of StatisticalLearning. Data Mining, Inference, and Prediction. SpringerKUNCHEVA LI (2004). Combining Pattern Classifiers. John Wiley and SonsMURPHY KP (2012). Machine Learning. A Probabilistic Perspective. The MITPressLIU H, MOTODA H (1998) Feature Selection for Knowledge Discovery and DataMining. Kluwer Academic PublishersSAEYS Y, INZA I, LARRANAGA P (2007). A review of feature selection techniquesin bioinformatics. Bioinformatics, 23(19), 2507-2517SHAWE-TAYLOR J, CRISTIANINI N (2000). Support Vector Machines and otherKernel-based Learning Methods. Cambridge University PressZHANG ML, ZHOU ZH (2013). A review on multi-label learning algorithms. IEEETransactions on Knowledge and Data Engineering, in press
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
References on Clustering
FORGY EW (1965). Cluster analysis of multivariate data: efficiency versusinterpretability of classifications. Biometrics, 21, 768-769
FREY BJ, DUECK D (2007). Clustering by passing messages between datapoints. Science, 315 (5814), 972-976
HARTIGAN JA (1975). Clustering Algorithms. John Wiley and Sons
JAIN AK, MURTY MN, FLYNN PJ (1999). Data clustering: A review. ACMComputing Surveys, 31, 3, 264-323
JOHNSON SC (1967). Hierarchical clustering schemes. Psychometrika,2,241-254
KAUFMAN L, ROUSSEEUW P (1990). Finding Groups in Data: An Introduction toCluster Analysis. Wiley and Sons
MACQUEEN JB (1967). Some methods for classification and analysis ofmultivariate observations. Proceedings of 5th Berkeley Symposium onMathematical Statistics and Probability, 281-297
P. Larranaga Machine Learning in Aviation
Intro Supervised Clustering Conclusions
MACHINE LEARNING IN AVIATION
PEDRO LARRANAGA
Computational Intelligence GroupArtificial Intelligence Department
Technical University of Madrid, Spain
Toulouse, July 11, 2013
P. Larranaga Machine Learning in Aviation