lecture 2 math

8/10/2019 Lecture 2 Math

1/34

Neural Networks and Learning Systems

TBMI 26

Ola [email protected]

Lecture 2

Supervised learning Linear models


2/34

Recap - Supervised learning

Task: Learn to predict/classify new data from

labeled examples.

Input: Training data examples {xi

, yi

} i=1...N,

where xi is a feature vector and yi is a class

label in the set W. Today well mostly assume

two classes: W = {-1,1}

Output: A function f(x;w1,,wk) W

2

Find a function f and adjust the parameters w1,,wk so that

new feature vectors are classified correctly. Generalization!!


3/34

How does the brain take decisions?(on the low level!)

Basic unit: the neuron

The human brain has approximately 100billion (1011) neurons.

Each neuron connected to about 7000 otherneurons.

Approx. 1014

- 1015

synapses (connections). 3

Cell body

AxonDendrites


4/34

Model of a neuron

4

Input signals

Weights

Summation

Activation function

Sx1

x2

xn

w1

wn

w2


5/34

The Perceptron

5

(McCulloch & Pitts 1943, Rosenblatt 1962)

xwTn wxwwwwxx =

=

=0

n

1i

ii00n1 ,,;,...,f

Extra reading on the history of the perceptron:

http://www.csulb.edu/~cwallis/artificialn/History.htm

xsignx =

xx =

xx tanh=

-1

1

-1

1Sigmoid function

Step function

Linear function

Not differentiable!

Not biologically plausible!
http://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htm


6/34

Notational simplification: Bias weight

6

xwTn xwwwxx =

=

=

n

0i

ii0n1 ,,;,...,f

Add a constant 1 to the feature vector so that wedont have to treat w0separately.

Instead of x= [x1,...,xn]T, we have x= [1, x1,...,xn]

T


7/34

Activation functions in 2D

7

xwxw TT sign=

xwxw TT =

xwxw TT tanh=


8/34

Geometry of linear classifiers

8

x1

x2 y = w0+w1x1+w2x2= 0

(w1,

w2

)T

y < 0

y > 0


9/34

Another viewthe super feature

9

=

=n

0i

iixwTxw

Reduces the original nfeatures

into one single feature

x

If seen as vectors, wT

xis theprojection of xonto w.

w

wTx

(w1,w2)T


10/34

Advantages of a parametric model

Only stores a few parameters (w0, w1, ,wn)

instead of all the training samples, as in k-NN.

Very fast to evaluate on which side of the line

a new sample is on: wTx0

is also known as a

discriminant function.

10

xwwx T=;f


11/34

Which linear classifier to choose?

11

x1

x2 y = w0+w1x1+w2x2= 0

(w1,

w2

)T

y < 0

y > 0


12/34

Find the best separator

Optimization!

Min/max of a cost function e(w0, w1, ,wn)

with the weights w0, w1, ,wnas parameters.

Two ways to optimize:

Algebraic: Set derivative and solve!

Iterative numeric: Follow the gradient direction

until minimum/maximum of g is reached.

This is called gradient descent/ascent!

12

0=

iw

e


13/34

Gradient descent/ascent

13

?

How to get to the lowest point?

=

2

1

x

f

x

f

f


14/34

Gradient descent

14

Small hLarge h

Too large h

Gradient escentChoosing the step length


15/34

Choosing the step length

The size of the gradient is not relevant!

15

ew)

Small derivative, small steps.

Large derivative, large steps.


16/34

Local optima

Gradient search is not guaranteed to find the

global minimum/maximum.

With a sufficiently small step length, the

closest local optimum will be found.

16

Global

minimumLocal

minima


17/34

Online learning vs. batch learning

Batch learning: Use all training examples to

update the classifier.

Most oftenly used.

Online learning: Update the classifier using

one training example at the time.

Can be used when training samples arrive

sequentially, e.g., adaptive learning.

Also known as stochastic ascent/descent.

17


18/34

Three different cost functions

Perceptron algorithm

Linear Discriminant Analysis(a.k.a. Fisher Linear Discriminant)

Support Vector Machines

18


19/34

Perceptron algorithm

19

xwT

i

in ywww

=,...,, 10e

Maximizethe following cost function

n = # featuresI = set of misclassified training samples

yi{-1,1} depending on the class of training sample i

Historically (1960s) one of the first machine learning algorithms

Negative for misclassified samples

Considers only misclassified samples

nwww ,...,, 10e is always negative, or 0 for a perfect classification!

Linear activation function


20/34

Perceptron algorithm, cont.

20

ik

i

i

k

xyw

=

e

xwT

i

in ywww =,...,,10e

i

i

iy xw

=

e

Algorithm:

1. Start with a random w

2. Iterate Eq. 1 until convergence

Gradient ascent:

1Eq.1 ii

ittt y xw

w

ww

=

= h

eh


21/34

Perceptron example

21


22/34

Perceptron algorithm, cont.

If the classes are linearly separable, the

perceptron algorithm will converge to a

solution that separates the classes.

If the classes are not linearly separable, the

algorithm will fail.

Which solution it arrives in depends on the

start value w.

22


23/34

Linear Discriminant Analysis (LDA)

23

x1

(w1,w2)T

x2

Consider the projected feature vectors


24/34

LDADifferent projections

24

x1

x2

Projection

Optimal projection

Separating plane


25/34

LDASeparation

25

Small varianceLarge distance

Large variance

Small distance

Goal: minimize variance and maximize distance.

1

1

2

2

2

2

2

1

2

21

e

-=Maximize:


26/34

LDACost function

26

2

2

2

1

2

21

e

-=

wCwwCwwCwww totTTT == 21

2

2

2

1

Cwwwxxxxwxwxww Tn

i

T

ii

Tn

i

T

i

T

nn

=

--==-=

== 1

2

1

2 11

Variance:

C

xwxwxww Tn

i

i

Tn

i

i

T

nn=

==

== 11

11

Distance: 1x

Mwwwxxxxwxxwww

TTTT =--=-=- 2121

2

21

2

21

M

2x


27/34

LDACost function, cont.

27

wCw

Mwww

tot

T

T

=-=

2

2

2

1

221

e

This form is called a (generalized) Rayleigh quotient,

which is maximized by the largest eigenvector to the

generalized eigenvalue problem Ctotw= lMw!

Simplification:

Some scalar K

212121 xxwxxxxMw -=--= KT

211~ xxCw --tot

Scaling of wnot important!


28/34

LDA - Examples

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3

4

5

28

-4 -3 -2 -1 0 1 2 3 4-4

-3

-2

-1

0

1

2

3

4


29/34

Support Vector Machines (SVM)Idea!

29

Optimal separation line remains the same, feature points

close to the class limits are more important!

These are called support vectors!

Support vectors


30/34

SVMMaximum margin

30

22

w

w

Choose wthat gives maximum margin !


31/34

SVMCost function

31

w

wTx+w0= 1

wTx+w0= -1

wTx+w0> 1

wTx+w0< -1

wTx+w0= 0

Scaling of wis freechoose a specific scaling!

xp

wT (xp+) +w0= 1

xs

wTxs+w0 = 1

xs= xp+

wT+ wTxp+ w0 = 1

0||w|| T = ||w||

For the support vector, = 1 / ||w||


32/34

SVMCost function, cont.

32

Maximizing= 1 / ||w|| is the same as minimizing ||w||!

1)(subject to

min

0 wy iT

i xw

w

No training samples must reside in the margin region!

Optimization procedure outside the scope of this course


33/34

SummaryLinear classifiers

Perceptron algorithm:

Of historical interest

Linear Discriminant Analysis

Simple to implement

Very useful as a first classifier to try

Support Vector Machines

By many considered as the state-of-the-art classifier

principle, especially the nonlinear forms (see nextlecture)

Many software packages exist on the internet

33


34/34

What about more than 2 classes?

Common solution: Combine several binary

classifiers

34

Evaluate most

likely class

lecture 2 math

Documents