supervised automatic learning models: a new...

Supervised Automatic learning models: A new perspective

Eugenio F. Sánchez Úbeda

Modelling and Simulation in Science6th Workshop of DAA15-22 April 2007, Erice, Italy

Supervised Automatic Learning models: A new perspective (E. Sánchez-Úbeda) - 2Erice, April 2007

Contents

• Summary of main concepts• Nature of learning problem• Learning difficulties• Multidimensional approaches• Summary and future research


Motivation

• Huge amounts of data are available in many disciplines of Science (& Industry)

• A large number of “different” learning approaches have been proposed

• Each domain uses its own terminology

Data


Objective

• Present a new outlook on the main existing learning strategies by:– Providing a rich overview of the main principles and

methods underlying most of the supervised models – Using a taxonomy that allows highlighting the similarity of

some models whose original motivation comes from different fields

DataUniverse of automatic

learning models Knowledge


Data Mining process: typical procedure

• Problem definition (variables)• Attribute selection• Model generation • Interpretation and validation of results • Model application

DB of examples

Automatic Learning

Knowledge about the system

Collection of data

RealSystem


Automatic learning: main goals

MODEL

System

Data

Running the model

Estimate

Looking at the model

Understand


Supervised learning: Idea

Real System

Observed Inputs

Non-observed Inputs

Outputs

z 1

zΩ

yi = gi ( x1 ,L , xp , z1 ,L , zΩ )yq

y1

y2

x2

x1

xp

ModelObserved Inputs Outputs

yq

y1

y2x2

x1

xp ipii xxy εφ += ),,( 1 L


Data base of examples

Real System

Observed Inputs Outputs

DB of examples

Non-observed Inputs

x, y( )

yi = gi ( x1 ,L , xp , z1 ,L , zΩ )yq

y1

y2

z 1

zΩ

x2 x1

xp

Set of possible situations (i.e. input vectors)

Simulate behaviour of the real system

(i.e. obtain output vectors)

Generate possible

situationssimulated outputs

DB of examples

x, y( )

yx

z

• Via simulation:

• Via collection:


Sets of examples

{ } Neyyxx qeepee ,1 ,),,(),,,( 11 =LL

Growing set Pruning set Test or Validation set

Build model(learning process)

Evaluate future effectiveness

of the model

Model


Learning problem definition

• Simple to state:

– “Find a model of a desired dependence using a limited number of observations”

• … but difficult to solve in general

– Two main difficulties:• Finite size of the set of examples• Random noise


Learning difficulties: Lack of examples

(a) Infinite number of points (b)

x

φ (x )y

Finite number of points

x

φ (x )y

x y

∞

model

? ? ? ??

InterpolateExtrapolateInfinite examples covering the whole input space

(Considering the case without noise)

Ideally Ideal reality


Learning difficulties: Noise

(a) Infinite number of points (b)

x

y

Finite number of points

x

y

? ? ? ??

φ (x )+ ε

φ (x )

φ (x )+ ε

mean = φ (x)

Infinite examples covering the whole input space

No problem

True Reality


Learning problem: Statistical viewpoint (I)

qigzzxxgy ipii ,1),,(),,,,,( 11 === Ω zxLL

zzxx ∫= dpp ),()(

(a) y=g(x,z)

0

5

10 0

5

10246

x

z

(b) joint probability density function p(x,z)

0

5

10 0

5

10

x

z

(c)y=f(x)+noise

0 5 10

2

4

6

x

(d)

0 2 4 6 8 10x

marginal probability density function p(x)

εφ += )(xy

Ideally Reality


Learning problem: Statistical viewpoint (II)

• For a given point x we can assume that the outputs are random variables defined by:

∫=∀

=yzxz

zxzxy),(/

)()(g

dpp

qixxy ipii ,1,),,( 1 =+= εφ L

y φ (x )

x

φ (x )φ (x ) + ε


Overfitting and oversmoothing

(b) Oversmoothing (d)

classification problem

(a) Oversmoothing

classification problem

Overfitting

(c)

regression problem regression problem

Overfitting

• Learning algorithms must avoid being trapped by the overfitting and oversmoothing problems

Typical in

“Conservative”

algorithms

Typical in “Risky”algorithms


The bias-variance trade-off: Idea (I)• We are playing darts

– dart player's objective is to hit the bull's-eye target– Imagine that the player has his eyes bandaged– the player has to estimate where is the bull's-eye before

throwing the dart

True underlying function (unknown)

Learning set 1

Learning set 2model


The bias-variance trade-off: Idea (II)

target

bias

realization

+variance

Error = bias2 + variance

• The dart player's error can be decomposed in two components: – Systematic error (bias)– Random error (variance)

we are using (for the same true target) several realizations to

measure the dart player's accuracy.

In practice the bias and variance cannot be easily estimated


The bias-variance trade-off: Example I

• Model: running-line smoother

0

0.2

0.4

0.6

0.8

1

1.2

1.4

-4 -3 -2 -1 0 1 2 3 4x

True functionLS1 (N=100)LS2 (N=100)

y

x

Pass over the data

xexe−w xe+w

2w + 1

y

x

xexe−w xe+w

2w + 1

(auxiliary) model

se

averaging process

abxaves ewe +)( =•=

• Problem:

• 100 learning sets– 100 examples

each one


The bias-variance trade-off: Example I(a)

0

0.2

0.4

0.6

0.8

1

1.2

-4 -3 -2 -1 0 1 2 3 4x

True functionmodels (w=1)

(b)

0

0.2

0.4

0.6

0.8

1

1.2

-4 -3 -2 -1 0 1 2 3 4x

True functionmodels (w=30)

(a)

0

0.2

0.4

0.6

0.8

1

1.2

-4 -3 -2 -1 0 1 2 3 4x

True functionaverage model (w=1)

(b)

0

0.2

0.4

0.6

0.8

1

1.2

-4 -3 -2 -1 0 1 2 3 4x

True functionaverage model (w=30)

• Too rigid models:

• Too flexible models:

Small variance

High variance

Small bias

High biasOversmoothing

Overfitting


The bias-variance trade-off: Example I

0

0.001

0.002

0.003

0.004

0.005

0 5 10 15 20 25 30w

squared biasvariance

MSE

MSE = bias2 + variance

Min. MSE (validation set)Good compromise between bias and variance

variance

bias


The bias-variance trade-off: Example II

Straight lineLarge MSE (mostly bias)

2nd degree polynomialSmallest MSE

• 100 learning sets– 500 examples

each one


Curse of dimensionality (I)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p=1 N=1000 (Uniform)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

p=2 N=1000 (Uniform)p=3 N=1000 (Uniform)

0 0.3 0.6 0.9 00.3

0.60.9

00.30.60.9

• This curse replaces the geometrical intuition gained from low dimensional spaces with surprising and unexpected properties of the high dimensional ones

• Same number of points, uniformy generated on [0,1]p


Curse of dimensionality (II)

• Maintaining the sampling density:

0

50

100

150

200

250

300

350

400

450

500

1 1e2 1e4 1e6 1e8 1e10

sam

plin

g de

nsity

N

p=1p=2p=3p=4p=5

high-dimensional spaces are always very sparse

p=1 p=5p=2 p=3 p=4


Supervised models

• Two main parts:– internal structure (set of possible functions)

– parameters (select the function)

• Example:M

M xxxxfy ββββ ++++== L2210)(

{ }Mββ ,,0 L

MParameters:

No. parameters (structure):M=2


Learning strategies• To be successful, several decisions need to be made

correctly:

Select internal structure

of the model

Adjust parameters of the model

Try with another structure?

yes

human decision

human decision or automatic

automatic

yes

Try with another type?

Select type of model


)(xfy =

Dealing with high dimensions

• Standard approach:– Divide/Combine and conquer via additive models

• Weighted sum of basis functions

)(xjB

∑=

=Mj

jj Bf,0

)()( xx β


Multidimensional models: strategies

• Approaches– Partitioning the input space– Projecting– Using norms

RBFN’sClassification and regression trees MLP’s

∑=

==Mj

jj Bfy,0

)()( xx β


Partitioning the input space (I)

• Membership function:

⎩⎨⎧ ∈

=otherwise0

if1)(

RR

xxμ

x1

x2

μR (x)

μ a1 ,b1[ ]

μ a2 ,b2[ ]

R

[ ]∏=

=pu

ubaR xuu

,1, )()( μμ x

• Using axis-oriented membership functions:

R


Partitioning the input space (II)

• Crisp vs fuzzy partitioning:

μR (x)

x

1

(b)

0

Fuzzy partitioning

x

1

(a)

0

Crisp partitioning

μR (x)

very_low low medium high very_high

non-overlapping regions


Partitioning the input space (III)• Example: Regression tree (a)

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-3.5 -2.4 -1.1 0.7 1.7 2.6 4x

modelTrue f.Scatt.

(b)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

-3.5 -2.4 -1.1 0.7 1.7 2.6 4x

0.255*B20.644*B40.968*B60.734*B7

Learning set: parabola_noise.tr N=81Algorithm:Regression Var_min: 0.012Number_of_nodes=19 (Number_of_basis_functions=10)

Output:

-0.2000 (y min) 0.2667 0.7333 1.2000 (y max)0.657

1e-01

x1 < -2.45?

81

Y0.2904e-02

x1 < -3.25?

16

Y0.1142e-02

x1 < -3.55?

8

Y0.0305e-03

0.030

5N

0.2551e-03

0.255

3

N0.4672e-03

0.467

8

N0.7479e-02

x1 < 2.65?

65

Y0.8762e-02

x1 < -1.95?

51

Y0.6449e-03

0.644

5

N0.9012e-02

x1 < 1.75?

46

Y0.9421e-02

x1 < -1.15?

37

Y0.8484e-03

0.848

8

N0.9681e-02

0.968

29

N0.7348e-03

0.734

9

N0.2754e-02

x1 < 3.45?

14

Y0.4039e-03

0.403

8N

0.1043e-02

x1 < 3.95?

6

Y0.1651e-02

0.165

5 N-0.1990e+00

-0.199

1B4

B6

0.644*B4

0.968*B6

∑=

==Mj

jj Bfy,0

)()( xx β


Projecting (I)

• Inner product:

• To obtain the standard projection of v along u the previous quantity must be normalized (dividing by the module u)

vuvuT

=θcos

∑=

==pk

kkTT vu

,1uvvu

u

v

uT vu

u T vv

θ


Projecting (II)

-1 α0

α1

α2x2

x1

Neural network representation

(b)

0)( α+== xαx Tfy

),,( 1 pαα L=α

0)( α+== xαx Tfy

2=p

α : controls the slope: controls the orientationα

0α : mean value

• Example: Straight hyperplane


Sigmoid-like basis functions (I)

-1

Contour lines

f (−α 0 + αT x)α0

α1

α2x2

x1

x1

x2

x2 x1

Neural network representation

(a)orientation

position

(b)

(α 1,α2 )

α0

Sigmoid

)exp(11)()()( 0 η

ηα−+

==−= ffB Tj xαx +

• S-shaped functions:

xαT+02 αη −=

∑=

==pk

kkTT vu

,1uvvu


Sigmoid: varying the orientation

),( 21 αα : controls the orientation

)exp(11)()()( 0 η

ηα−+

==−= ffB Tj xαx + xαT+0

2 αη −=


Sigmoid: varying the slope

αincreasing

α : controls the slope


Sigmoid: varying the position

0αincreasing α(fixed )

α/0α : center of the sigmoid


Norms

( ) ( ) ( )vuvuvu −−=−=− ∑=

T

pkkk vu

,1

22

( ) ( )vuWvuvuW

−−=− T2

• Weighted norms

⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢

⎣

⎡

=

2

21

10

01

pσ

σOW

• Elliptical norms

• The (squared) Euclidean distance:


Gaussian-like Basis Functions (III)

( ) ( )vuWvuvuW

−−=− T2

(a)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

center-20 20x

Gaussian

sigma=0.1sigma=0.2sigma=0.5

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ −−= 2

2

2exp)(

σςx

xf

• One-dimensional:

: spread: center (position)ς

σ

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛−−

⎥⎦

⎤⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛−−

−=22

112

2

21

22

11

/100/1exp)(

ςς

σσ

ςς

xx

xx

fT

x


Gaussian: varying the position


Gaussian: varying the orientation


Gaussian: varying the spread (both dir.)


Gaussian: varying the spread (one dir.)

1σincreasing


Norms and projections (I)

A B

AB

Euclidean distance

Inner product

C

C

AB

contour linescontour lines

Euclidean distance x − z Inner product zTx

x2

x1

z

x1

x2

zC

AB

C

(a) (b)

u

v

uT vu

u T vv

θ

u − v


Norms and projections (II)• They are mathematical rules which provide

– a particular grouping of the data in the input space (clusters)– a ranking of these clusters.

• All points within a particular cluster have the same index. These indexes provide an order relationship between clusters.

multidimensionalinput space

Clustering

Index assignment

one-dimensionalspace


Taxonomy of models: criteria

• Number of input variables– One-dimensional models (e.g. straight line)– Multidimensional models (e.g. hyper plane)

• Complexity– Basic models (e.g. Gaussian)– Sophisticated models (e.g. RBFN)

• Internal structure of the model– Structured models (e.g. linear regression)– Unstructured models (e.g. K-NN)– Hybrid


Taxonomy of one-dimensional models

ConstantStraight lineSigmoid-likeGaussian-like...

basic

PolynomialsSplinesWaveletsHinges...

sophisticated

structured

Running-meanRunning-line...

basic

Supersmoother...

sophisticated

unstructured

one-dimensional models


Taxonomy of multidimensional models

ConstantStraight hyperplaneSigmoid-likeGaussian-likeHinges...

basic

RBFNMLPCARTFuzzy treesMARSORTHO & OBLIQUE...

sophisticated

structured

SMART

hybrid

Standard K-NN...

basic

MacheteDT GANN...

sophisticated

unstructured

multidimensional models


Example: Sophisticated structured models

• Radial Basis Function neuralnetworks (RBFN’s):– Gaussian-like BF’s

• MultiLayer Perceptrons(MLP’s):– Sigmoid-like BF’s

β0

β j

βM

InputLayer

Output Layer

BM

Bj

B0

B1

β1

x1

x2

xk

xp

HiddenLayer

A B

AB

Euclidean distance

Inner product

C

C

RBFN’s

MLP’s ∑=

==Mj

jj Bfy,0

)()( xx β


Summary

• The learning problem is difficult to solve in general• Be careful when estimating the generalization

capability of a model• Multidimensional models (like buildings) are made of

small simple pieces• Norms and projections are power mathematic

instruments for building models• The proposed taxonomy provide a rational approach

to automatic learning models


Expected future research

High interpretability

Low interpretability

Low precision High precision

Artificial neural networks

Decision trees

Very accurate models, but still black-boxesVery comprehensible models, but not too accurate

Goal forfuture models

Very comprehensible and accurate models?


SUPERVISED AUTOMATIC LEARNING MODELS:

A NEW PERSPECTIVE

Eugenio Fco. Sánchez Úbeda

supervised automatic learning models: a new...

Documents