isabelle_webinar_jan
DESCRIPTION
TRANSCRIPT
Learning Theory Put to Work
Isabelle [email protected]
What is the process ofData Mining / Machine Learning?
Learning
algorithm
TRAININGDATA Answer
Trained
machine
Query
For which tasks ?
• Classification (binary/categorical target)• Regression and time series prediction
(continuous targets)• Clustering (targets unknown)• Rule discovery
Bioinformatics
Quality control
Machine vision
Customer knowledge
For which applications ?
inputs
training examples
10
102
103
104
105
OCRHWR
MarketAnalysis
TextCategorization
Syst
em d
iagn
osis
10 102 103 104 105
106
Banking / Telecom / Retail
• Identify:– Prospective customers– Dissatisfied customers– Good customers– Bad payers
• Obtain:– More effective advertising– Less credit risk– Fewer fraud– Decreased churn rate
Biomedical / Biometrics
• Medicine:– Screening– Diagnosis and prognosis– Drug discovery
• Security:– Face recognition– Signature / fingerprint / iris
verification– DNA fingerprinting
Computer / Internet
• Computer interfaces:– Troubleshooting wizards – Handwriting and speech– Brain waves
• Internet– Hit ranking– Spam filtering– Text categorization– Text translation– Recommendation
From Statistics to Machine Learning… and back!
• Old textbook statistics were descriptive :– Mean, variance– Confidence intervals– Statistical tests– Fit data, discover distributions (past data)
• Machine learning (1960’s) is predictive :– Training / validation / test sets– Build robust predictive models (future data)
• Learning theory (1990’s) :– Rigorous statistical framework for ML– Proper monitoring of fit vs. robustness
Some Learning Machines
• Linear models• Polynomial models • Kernel methods• Neural networks• Decision trees
Conventions
X={xij}
n attributes/features
m samples /customers /patientsxi
y ={yj}
w
Linear Models
f(x) = j=1:n wj xj + b
Linear discriminant (for classification):
• F(x) = 1 if f(x)>0
• F(x) = -1 if f(x)0
LINEAR = WEIGHTED SUM
Non-linear models
Linear models (artificial neurons)
• f(x) = j=1:n wj xj + b
Models non-linear in their inputs, but linear in their parameters
• f(x) = j=1:N wj j(x) + b (Perceptron)
• f(x) = i=1:m i k(xi,x) + b (Kernel method)
Other non-linear models• Neural networks / multi-layer perceptrons• Decision trees
Linear Decision Boundary
hyperplane
x1
x2
f(x) = 0
f(x) > 0
f(x) < 0
x1
x2
f(x) = 0
f(x) > 0
f(x) < 0
NL Decision Boundary
x1
x2
Fit / Robustness Tradeoff
x1
x2
Predictions: F(x)
Class -1 Class +1
Truth:y
Class -1 tn fp
Class +1 fn tp
Cost matrixPredictions: F(x)
Class -1 Class +1
Truth:y
Class -1 tn fp
Class +1 fn tp
neg=tn+fp
Total
pos=fn+tp
sel=fp+tprej=tn+fnTotal m=tn+fp +fn+tp
False alarm = fp/neg
Class +1 / Total
Hit rate = tp/pos
Frac. selected = sel/m
Cost matrix
Class+1/Total
Precision
= tp/selFalse alarm rate = type I errate = 1-specificity
Hit rate = 1-type II errate = sensitivity = recall = test power
Performance Assessment
Compare F(x) = sign(f(x)) to the target y, and report:• Error rate = (fn + fp)/m• {Hit rate , False alarm rate} or {Hit rate , Precision} or {Hit rate , Frac.selected} • Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2• F measure = 2 precision.recall/(precision+recall)
Vary the decision threshold in F(x) = sign(f(x)+), and plot: • ROC curve: Hit rate vs. False alarm rate• Lift curve: Hit rate vs. Fraction selected• Precision/recall curve: Hit rate vs. Precision
Predictions: F(x)
Class -1 Class +1
Truth:y
Class -1 tn fp
Class +1 fn tp
neg=tn+fp
Total
pos=fn+tp
sel=fp+tprej=tn+fnTotal m=tn+fp +fn+tp
False alarm = fp/neg
Class +1 / Total
Hit rate = tp/pos
Frac. selected = sel/m
Cost matrixPredictions: F(x)
Class -1 Class +1
Truth:y
Class -1 tn fp
Class +1 fn tp
neg=tn+fp
Total
pos=fn+tp
sel=fp+tprej=tn+fnTotal m=tn+fp +fn+tp
Cost matrix
ROC Curve
False alarm rate = 1 - Specificity
Hit
rat
e =
Sen
siti
vity
Ideal ROC curve (AUC=1)
100%
100%
Patients diagnosed by putting a threshold on f(x).
For a given threshold you get a point on the ROC curve.
0 AUC 1
Actual R
OC
Random ROC (AUC=0.5)
0
Lift Curve
OMGini
O M
Fraction of customers selected
Hit
rat
e =
Fra
c. g
ood
cust
omer
s se
lect
.
Random lift
Ideal Lift
100%
100%Customers ranked according to f(x); selection of the top ranking customers.
Gini=2 AUC-1
0 Gini 1
Actual L
ift
0
What is a Risk Functional?
A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task.
Examples:
• Classification: – Error rate: (1/m) i=1:m 1(F(xi)yi)
– 1- AUC (Gini Index = 2 AUC-1)
• Regression: – Mean square error: (1/m) i=1:m(f(xi)-yi)2
How to train?
• Define a risk functional R[f(x,w)]• Optimize it w.r.t. w (gradient descent,
mathematical programming, simulated annealing, genetic algorithms, etc.)
Parameter space (w)
R[f(x,w)]
w*
Theoretical Foundations
• Structural Risk Minimization
• Regularization
• Weight decay
• Feature selection
• Data compression
Training powerful models, without overfitting
Ockham’s Razor
• Principle proposed by William of Ockham in the fourteenth century: “Pluralitas non est ponenda sine neccesitate”.
• Of two theories providing similarly good predictions, prefer the simplest one.
• Shave off unnecessary parameters of your models.
Risk Minimization
• Examples are given:
(x1, y1), (x2, y2), … (xm, ym)
loss function unknown distribution
• Learning problem: find the best function f(x; ) minimizing a risk functional
R[f] = L(f(x; w), y) dP(x, y)
Approximations of R[f]
• Empirical risk: Rtrain[f] = (1/n) i=1:m L(f(xi; w), yi)
– 0/1 loss 1(F(xi)yi) : Rtrain[f] = error rate
– square loss (f(xi)-yi)2 : Rtrain[f] = mean square error
• Guaranteed risk:
With high probability (1-), R[f] Rgua[f]
Rgua[f] = Rtrain[f] + C
Structural Risk Minimization
Vapnik, 1974
S3
S2
S1
Increasing complexity
Nested subsets of models, increasing complexity/capacity:
S1 S2 … SN
0 10 20 30 40 50 60 70 80 90 100 0
0.2
0.4
0.6
0.8
1
1.2
1.4
Tr, Training error
Ga, Guaranteed riskGa= Tr + C
, Function of Model Complexity C
Complexity/Capacity C
SRM Example
• Rank with ||w||2 = i wi2
Sk = { w | ||w||2 < k2 }, 1<2<…<k
• Minimization under constraint:
min Rtrain[f] s.t. ||w||2 < k2
• Lagrangian:
Rreg[f,] = Rtrain[f] + ||w||2
R
capacity
S1 S2 … SN
Multiple Structures
• Shrinkage (weight decay, ridge regression, SVM):
Sk = { w | ||w||2< k }, 1<2<…<k
1 > 2 > 3 >… > k ( is the ridge)
• Feature selection:
Sk = { w | ||w||0< k },
1<2<…<k ( is the number of features)
• Data compression:
1<2<…<k ( may be the number of clusters)
Hyper-parameter Selection
• Learning = adjusting: parameters (w vector).
hyper-parameters().
• Cross-validation with K-folds:
For various values of : - Adjust w on a fraction (K-1)/K of
training examples e.g. 9/10th. - Test on 1/K remaining examples
e.g. 1/10th. - Rotate examples and average test
results (CV error). - Select to minimize CV error. - Re-compute w on all training
examples using optimal .
X y
Prospective study / “real”
validation
Tra
inin
g d
ata:
Mak
e K
fol
ds
Tes
t da
ta
Summary
• SRM provides a theoretical framework for robust predictive modeling (overfitting avoidance), using the notions of guaranteed risk and model capacity.
• Multiple structures may be used to control the model capacity, including: feature selection, data compression, ridge regression.
KXEN (simplified) architecture
x1
xn
x3
x2
Output
System
y1
yp
y2
Input
kx ky
Data
Pre
paratio
n
Learning Algorithm
Class of Models
Data
Encodin
g
Loss C
riterio
n
kxky
w
KXEN: SRM put to work
OMKI
OM
Fraction of customers selected
Fra
ctio
n o
f go
od c
usto
mer
s se
lect
ed
Random lift
Ideal Lift
100%
100%Customers ranked according to f(x); selection of the top ranking customers.
G
CV lift
Training lift
OGKR 1
Test lif
t
Want to Learn More?
• Statistical Learning Theory, V. Vapnik. Theoretical book. Reference book on generatization, VC dimension, Structural Risk Minimization, SVMs, ISBN : 0471030031.
• Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlab code. http://rii.ricoh.com/~stork/DHS.html
• The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/
• Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http://clopinet.com/fextract-book