isabelle_webinar_jan

Learning Theory Put to Work

Isabelle [email protected]

What is the process ofData Mining / Machine Learning?

Learning

algorithm

TRAININGDATA Answer

Trained

machine

Query

For which tasks ?

• Classification (binary/categorical target)• Regression and time series prediction

(continuous targets)• Clustering (targets unknown)• Rule discovery

Bioinformatics

Quality control

Machine vision

Customer knowledge

For which applications ?

inputs

training examples

10

102

103

104

105

OCRHWR

MarketAnalysis

TextCategorization

Syst

em d

iagn

osis

10 102 103 104 105

106

Banking / Telecom / Retail

• Identify:– Prospective customers– Dissatisfied customers– Good customers– Bad payers

• Obtain:– More effective advertising– Less credit risk– Fewer fraud– Decreased churn rate

Biomedical / Biometrics

• Medicine:– Screening– Diagnosis and prognosis– Drug discovery

• Security:– Face recognition– Signature / fingerprint / iris

verification– DNA fingerprinting

Computer / Internet

• Computer interfaces:– Troubleshooting wizards – Handwriting and speech– Brain waves

• Internet– Hit ranking– Spam filtering– Text categorization– Text translation– Recommendation

From Statistics to Machine Learning… and back!

• Old textbook statistics were descriptive :– Mean, variance– Confidence intervals– Statistical tests– Fit data, discover distributions (past data)

• Machine learning (1960’s) is predictive :– Training / validation / test sets– Build robust predictive models (future data)

• Learning theory (1990’s) :– Rigorous statistical framework for ML– Proper monitoring of fit vs. robustness

Some Learning Machines

• Linear models• Polynomial models • Kernel methods• Neural networks• Decision trees

Conventions

X={xij}

n attributes/features

m samples /customers /patientsxi

y ={yj}

w

Linear Models

f(x) = j=1:n wj xj + b

Linear discriminant (for classification):

• F(x) = 1 if f(x)>0

• F(x) = -1 if f(x)0

LINEAR = WEIGHTED SUM

Non-linear models

Linear models (artificial neurons)

• f(x) = j=1:n wj xj + b

Models non-linear in their inputs, but linear in their parameters

• f(x) = j=1:N wj j(x) + b (Perceptron)

• f(x) = i=1:m i k(xi,x) + b (Kernel method)

Other non-linear models• Neural networks / multi-layer perceptrons• Decision trees

Linear Decision Boundary

hyperplane

x1

x2

f(x) = 0

f(x) > 0

f(x) < 0

x1

x2

f(x) = 0

f(x) > 0

f(x) < 0

NL Decision Boundary

x1

x2

Fit / Robustness Tradeoff

x1

x2

Predictions: F(x)

Class -1 Class +1

Truth:y

Class -1 tn fp

Class +1 fn tp

Cost matrixPredictions: F(x)

Class -1 Class +1

Truth:y

Class -1 tn fp

Class +1 fn tp

neg=tn+fp

Total

pos=fn+tp

sel=fp+tprej=tn+fnTotal m=tn+fp +fn+tp

False alarm = fp/neg

Class +1 / Total

Hit rate = tp/pos

Frac. selected = sel/m

Cost matrix

Class+1/Total

Precision

= tp/selFalse alarm rate = type I errate = 1-specificity

Hit rate = 1-type II errate = sensitivity = recall = test power

Performance Assessment

Compare F(x) = sign(f(x)) to the target y, and report:• Error rate = (fn + fp)/m• {Hit rate , False alarm rate} or {Hit rate , Precision} or {Hit rate , Frac.selected} • Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2• F measure = 2 precision.recall/(precision+recall)

Vary the decision threshold in F(x) = sign(f(x)+), and plot: • ROC curve: Hit rate vs. False alarm rate• Lift curve: Hit rate vs. Fraction selected• Precision/recall curve: Hit rate vs. Precision

Predictions: F(x)

Class -1 Class +1

Truth:y

Class -1 tn fp

Class +1 fn tp

neg=tn+fp

Total

pos=fn+tp


False alarm = fp/neg

Class +1 / Total

Hit rate = tp/pos

Frac. selected = sel/m

Cost matrixPredictions: F(x)

Class -1 Class +1

Truth:y

Class -1 tn fp

Class +1 fn tp

neg=tn+fp

Total

pos=fn+tp


Cost matrix

ROC Curve

False alarm rate = 1 - Specificity

Hit

rat

e =

Sen

siti

vity

Ideal ROC curve (AUC=1)

100%

100%

Patients diagnosed by putting a threshold on f(x).

For a given threshold you get a point on the ROC curve.

0 AUC 1

Actual R

OC

Random ROC (AUC=0.5)

0

Lift Curve

OMGini

O M

Fraction of customers selected

Hit

rat

e =

Fra

c. g

ood

cust

omer

s se

lect

.

Random lift

Ideal Lift

100%

100%Customers ranked according to f(x); selection of the top ranking customers.

Gini=2 AUC-1

0 Gini 1

Actual L

ift

0

What is a Risk Functional?

A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task.

Examples:

• Classification: – Error rate: (1/m) i=1:m 1(F(xi)yi)

– 1- AUC (Gini Index = 2 AUC-1)

• Regression: – Mean square error: (1/m) i=1:m(f(xi)-yi)2

How to train?

• Define a risk functional R[f(x,w)]• Optimize it w.r.t. w (gradient descent,

mathematical programming, simulated annealing, genetic algorithms, etc.)

Parameter space (w)

R[f(x,w)]

w*

Theoretical Foundations

• Structural Risk Minimization

• Regularization

• Weight decay

• Feature selection

• Data compression

Training powerful models, without overfitting

Ockham’s Razor

• Principle proposed by William of Ockham in the fourteenth century: “Pluralitas non est ponenda sine neccesitate”.

• Of two theories providing similarly good predictions, prefer the simplest one.

• Shave off unnecessary parameters of your models.

Risk Minimization

• Examples are given:

(x1, y1), (x2, y2), … (xm, ym)

loss function unknown distribution

• Learning problem: find the best function f(x; ) minimizing a risk functional

R[f] = L(f(x; w), y) dP(x, y)

Approximations of R[f]

• Empirical risk: Rtrain[f] = (1/n) i=1:m L(f(xi; w), yi)

– 0/1 loss 1(F(xi)yi) : Rtrain[f] = error rate

– square loss (f(xi)-yi)2 : Rtrain[f] = mean square error

• Guaranteed risk:

With high probability (1-), R[f] Rgua[f]

Rgua[f] = Rtrain[f] + C

Structural Risk Minimization

Vapnik, 1974

S3

S2

S1

Increasing complexity

Nested subsets of models, increasing complexity/capacity:

S1 S2 … SN

0 10 20 30 40 50 60 70 80 90 100 0

0.2

0.4

0.6

0.8

1

1.2

1.4

Tr, Training error

Ga, Guaranteed riskGa= Tr + C

, Function of Model Complexity C

Complexity/Capacity C

SRM Example

• Rank with ||w||2 = i wi2

Sk = { w | ||w||2 < k2 }, 1<2<…<k

• Minimization under constraint:

min Rtrain[f] s.t. ||w||2 < k2

• Lagrangian:

Rreg[f,] = Rtrain[f] + ||w||2

R

capacity

S1 S2 … SN

Multiple Structures

• Shrinkage (weight decay, ridge regression, SVM):

Sk = { w | ||w||2< k }, 1<2<…<k

1 > 2 > 3 >… > k ( is the ridge)

• Feature selection:

Sk = { w | ||w||0< k },

1<2<…<k ( is the number of features)

• Data compression:

1<2<…<k ( may be the number of clusters)

Hyper-parameter Selection

• Learning = adjusting: parameters (w vector).

hyper-parameters().

• Cross-validation with K-folds:

For various values of : - Adjust w on a fraction (K-1)/K of

training examples e.g. 9/10th. - Test on 1/K remaining examples

e.g. 1/10th. - Rotate examples and average test

results (CV error). - Select to minimize CV error. - Re-compute w on all training

examples using optimal .

X y

Prospective study / “real”

validation

Tra

inin

g d

ata:

Mak

e K

fol

ds

Tes

t da

ta

Summary

• SRM provides a theoretical framework for robust predictive modeling (overfitting avoidance), using the notions of guaranteed risk and model capacity.

• Multiple structures may be used to control the model capacity, including: feature selection, data compression, ridge regression.

KXEN (simplified) architecture

x1

xn

x3

x2

Output

System

y1

yp

y2

Input

kx ky

Data

Pre

paratio

n

Learning Algorithm

Class of Models

Data

Encodin

g

Loss C

riterio

n

kxky

w

KXEN: SRM put to work

OMKI

OM

Fraction of customers selected

Fra

ctio

n o

f go

od c

usto

mer

s se

lect

ed

Random lift

Ideal Lift

100%

100%Customers ranked according to f(x); selection of the top ranking customers.

G

CV lift

Training lift

OGKR 1

Test lif

t

Want to Learn More?

• Statistical Learning Theory, V. Vapnik. Theoretical book. Reference book on generatization, VC dimension, Structural Risk Minimization, SVMs, ISBN : 0471030031.

• Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlab code. http://rii.ricoh.com/~stork/DHS.html

• The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/

• Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http://clopinet.com/fextract-book

isabelle_webinar_jan

Documents

conventions x

parameters f x

linear models f x

y class

tn fp class

b perceptron f x

fx class

nw j j x