Some Take-Home Messages (THM) about ML....Data Science Meetup
Gianluca BontempiInteruniversity Institute of Bioinformatics in Brussels, (IB)2
Machine Learning Group,Computer Science Department, ULBmlg.ulb.ac.be, ibsquare.be
May 20, 2016
Introducing myself
1992: Computer science engineer (Politecnico di Milano, Italy),
1994: Researcher in robotics in IRST, Trento, Italy,
1995: Researcher in IRIDIA, ULB, Brussels,
1996-97: Researcher in IDSIA, Lugano, Switzerland,
1998-2000: Marie Curie fellowship in IRIDIA, ULB,
2000-2001: Scientist in Philips Research, Eindhoven, TheNetherlands,
2001-2002: Scientist in IMEC, Microelectronics Institute,Leuven, Belgium,
since 2002: professor in Machine Learning, Modeling andSimulation, Bioinformatics in ULB Computer Science Dept.,
since 2004: head of the ULB Machine Learning Group (MLG).
since 2013: director of the Interuniversity Institute ofBioinformatics in Brussels (IB)2, ibsquare.be.
What is machine learning?
Machine learning is that domain of computational intelligencewhich is concerned with the question of how to construct computerprograms that automatically improve with experience. (Mitchell,97)
Reductionist attitude: ML is just a buzzword which equates tostatistics plus marketing
Positive attitude: ML paved the way to the treatment of realproblems related to data analysis, sometimesoverlooked by statisticians (nonlinearity, classification,pattern recognition, missing variables, adaptivity,optimization, massive datasets, data management,causality, representation of knowledge, parallelisation)
Interdisciplinary attitude: ML should have its roots on statisticsand complements it by focusing on: algorithmicissues, computational efficiency, data engineering.
Prediction is pervasive ...
Predict
whether you will like a book/movie (collaborative filtering)
credit applicants as low, medium, or high risk.
which home telephone lines are used for Internet access.
which customers are likely to stop being customers (churn).
the value of a piece of real estate
which telephone subscribers will order a 4G service
which CARREFOUR clients will be more interested to adiscount in Italian products.
the probability that a company is employing black workers(anti-fraud detection)
the survival risk of a patient on the basis of a genetic signature
the probability of a crime in an urban area.
the key of a cryptographic algorithm on the basis of powerconsumption
Supervised learning
First assumption: learning is essentially about prediction !Second assumption: reality is stochastic, dependency anduncertainty are well described by conditional probability.
PREDICTION
TARGET
TRAINING DATASET
INPUT OUTPUTERROR
PREDICTION
MODEL
measurable features (inputs)
measurable target variables (outputs) and accuracy criteria
data (in God we trust, all the others must bring data)
THM1: formalizing a problem as a prediction problem is often themost important contribution of a data scientist!
It is all about ...
1 Probabilistic modelingit formalizes uncertainty and dependency (regression function)notions of entropy and informationrelevant and irrelevant features (e.g. Markov blanket notion)Bayesian networks, causal reasoning
2 Estimationbias/variance notionsgeneralization issues: underfitting vs overfittingBayesian, frequentist, decision theoryvalidationcombination/averaging of estimators (bagging, boosting)
3 OptimizationMaximum likelihood, least squares, backpropagationDual problems (SVM)L1, L2 norm (lasso)
4 Computer scienceimplementation, algorithmsparallelism, scalabilitydata management
So ... how to teach machine learning?
Focus on ...
Formalism ?
Algorithms ?
Coding ?
Applications ?
Of course all is important but what is the essence, what is commonto the exploding number of algorithms, techniques, fancyapplications?
EstimationSTOCHASTIC PHENOMENON
DATA
LEARNER
DATA DATA
MODEL,
PREDICTION
LEARNER
MODEL,
PREDICTION
LEARNER
MODEL,
PREDICTION
THM2: a predictor is an estimator, i.e. an algorithm (black-box)which takes data and returns a prediction.THM3: reality is stochastic, so data is stochastic and prediction isstochastic.
Assessing in an un uncertain world (Baggio, 1998)
non aver paura di sbagliare un calcio di rigore, non è mica da questiparticolari che si giudica un giocatore (De Gregori, 1982)).
Assessing a learner
The goal of learning is to find a model which is able togeneralize, i.e. able to return good predictions in contextswith the same distribution but independent of the training set
How to estimate the quality of a model?
It is always possible to find models with such a complicatestructure that they have null training errors. Are these modelsgood?
Typically NOT. Since doing very well on the training set couldmean doing badly on new data.
This is the phenomenon of overfitting.
THM4: learning is challenging since data have to be used 1) forcreating prediction models and 2) for assessing them.
Bias and variance of a model
Estimation theory: mean-squared-error (a measure of thegeneralization quality) can be written as
MSE = σ2
w+ squared bias + variance
where
noise concerns the reality alone,bias reflects the relation between reality and the learningalgorithmvariance concerns the learning algorithm alone.
This is purely theoretical since these quantities cannot bemeasured ....
.. but useful to understand why and in which circumstanceslearners work.
The bias/variance dilemma
Noise is all that cannot be learned from data
Bias measures the lack of representational power of the classof hypotheses.
Too simple model ⇒ large bias ⇒ underfitting
Variance warns us against an excessive complexity of theapproximator.
Too complex model ⇒ large variance ⇒ overfitting
A neural network is less biased than a linear model butinevitably more variant.Averaging (e.g. bagging, boosting, random forests) is a goodcure for variance.
Bias/variance trade-off
complexity
generalizationerror
Bias
Variance
Underfitting Overfitting
THM5: think in terms of bias/variance tradeoff. Think to yourpreferred learning algorithm and discover how bias/variance ismanaged.
The Ockam’s Razor (1825)
THM6: "Pluralitas non est ponenda sine neccesitate" i.e. oneshould not increase, beyond what is necessary, the number ofentities required to explain anything.
This is the medieval rule of parsimony, or principle ofeconomy, known as Ockham’s razor.
In other terms the principle states that one should not makemore assumptions than the minimum needed.
It underlies all scientific modeling and theory building. Itadmonishes us to choose from a set of otherwise equivalentmodels the simplest one.
Be simple: "shave off" those concepts, variables or constructsthat are not really needed to explain the phenomenon.
Does the best exist?
Given a finite number of samples, are there any reasons toprefer one learning algorithm over another?
If we make no assumption about the nature of the learningtask, can we expect any learning method to be superior orinferior overall?
Can we even find an algorithm that is overall superior to (orinferior to) random guessing?
The No Free Lunch Theorem answers NO to these questions.
No Free Lunch theorem
If the goal is to obtain good generalization performance, thereare no context-independent or usage-independent reasonsto favor one learning method over another.
If one algorithm seems to outperform another in a particularsituation, it is a consequence of its fit to the particular patternrecognition problem, not the general superiority of thealgorithm.
The theorem also justifies the skeptiscism about studies thatdemonstrate the overall superiority of a particular learning orrecognition algorithm.
If a learning method performs well over some set of problems,then it must perform worse than average elsewhere. Nomethod can perform well throughout the full set of functions.
THM7: Every learning algorithm makes assumptions (most of thetimes in implicit manner) and these make the difference.
Conclusion
Popper claimed that, if a theory is falsifiable (i.e. it can becontradicted by an observation or the outcome of a physicalexperiment), then it is scientific. Since prediction is the mostfalsifiable aspect of science it is also the most scientific one.
Effective machine learning is an extension of statistics, in noway an alternative.
Simplest (i.e. linear) model first.
Modelling is more an art than an automatic process... thenexperience data analysts are more valuable than expensivetools.
Expert knowledge matters..., data too
Understanding what is predictable is as important as trying topredict it.
All models are wrong, some of them are useful.
All that we did not discuss...
Dimensionality reduction and feature selection
Causal inference
Unsupervised learning
Active learning
Spatio-temporal prediction
Nonstationary problems
Scalable machine learning
Control and robotics
Libraries and platforms (R, python, Weka)
Resources
A biased list ...:-)
Scoop-itwww.scoop.it/t/machine-learning-by-gianluca-bontempi
on machine learning
Scoop-itwww.scoop.it/t/probabilistic-reasoning-and-statistics
on Probabilistic reasoning, causal inference and statistics
MLG mlg.ulb.ac.be
MA course INFO-F-422 Statistical foundations of machinelearning
Handbook available on https://www.otexts.org