lecture 2 math

Upload: nikola001

Post on 02-Jun-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Lecture 2 Math

    1/34

    Neural Networks and Learning Systems

    TBMI 26

    Ola [email protected]

    Lecture 2

    Supervised learning Linear models

  • 8/10/2019 Lecture 2 Math

    2/34

    Recap - Supervised learning

    Task: Learn to predict/classify new data from

    labeled examples.

    Input: Training data examples {xi

    , yi

    } i=1...N,

    where xi is a feature vector and yi is a class

    label in the set W. Today well mostly assume

    two classes: W = {-1,1}

    Output: A function f(x;w1,,wk) W

    2

    Find a function f and adjust the parameters w1,,wk so that

    new feature vectors are classified correctly. Generalization!!

  • 8/10/2019 Lecture 2 Math

    3/34

    How does the brain take decisions?(on the low level!)

    Basic unit: the neuron

    The human brain has approximately 100billion (1011) neurons.

    Each neuron connected to about 7000 otherneurons.

    Approx. 1014

    - 1015

    synapses (connections). 3

    Cell body

    AxonDendrites

  • 8/10/2019 Lecture 2 Math

    4/34

    Model of a neuron

    4

    Input signals

    Weights

    Summation

    Activation function

    Sx1

    x2

    xn

    w1

    wn

    w2

  • 8/10/2019 Lecture 2 Math

    5/34

    The Perceptron

    5

    (McCulloch & Pitts 1943, Rosenblatt 1962)

    xwTn wxwwwwxx =

    =

    =0

    n

    1i

    ii00n1 ,,;,...,f

    Extra reading on the history of the perceptron:

    http://www.csulb.edu/~cwallis/artificialn/History.htm

    xsignx =

    xx =

    xx tanh=

    -1

    1

    -1

    1Sigmoid function

    Step function

    Linear function

    Not differentiable!

    Not biologically plausible!

    http://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htmhttp://www.csulb.edu/~cwallis/artificialn/History.htm
  • 8/10/2019 Lecture 2 Math

    6/34

    Notational simplification: Bias weight

    6

    xwTn xwwwxx =

    =

    =

    n

    0i

    ii0n1 ,,;,...,f

    Add a constant 1 to the feature vector so that wedont have to treat w0separately.

    Instead of x= [x1,...,xn]T, we have x= [1, x1,...,xn]

    T

  • 8/10/2019 Lecture 2 Math

    7/34

    Activation functions in 2D

    7

    xwxw TT sign=

    xwxw TT =

    xwxw TT tanh=

  • 8/10/2019 Lecture 2 Math

    8/34

    Geometry of linear classifiers

    8

    x1

    x2 y = w0+w1x1+w2x2= 0

    (w1,

    w2

    )T

    y < 0

    y > 0

  • 8/10/2019 Lecture 2 Math

    9/34

    Another viewthe super feature

    9

    =

    =n

    0i

    iixwTxw

    Reduces the original nfeatures

    into one single feature

    x

    If seen as vectors, wT

    xis theprojection of xonto w.

    w

    wTx

    (w1,w2)T

  • 8/10/2019 Lecture 2 Math

    10/34

    Advantages of a parametric model

    Only stores a few parameters (w0, w1, ,wn)

    instead of all the training samples, as in k-NN.

    Very fast to evaluate on which side of the line

    a new sample is on: wTx0

    is also known as a

    discriminant function.

    10

    xwwx T=;f

  • 8/10/2019 Lecture 2 Math

    11/34

    Which linear classifier to choose?

    11

    x1

    x2 y = w0+w1x1+w2x2= 0

    (w1,

    w2

    )T

    y < 0

    y > 0

  • 8/10/2019 Lecture 2 Math

    12/34

    Find the best separator

    Optimization!

    Min/max of a cost function e(w0, w1, ,wn)

    with the weights w0, w1, ,wnas parameters.

    Two ways to optimize:

    Algebraic: Set derivative and solve!

    Iterative numeric: Follow the gradient direction

    until minimum/maximum of g is reached.

    This is called gradient descent/ascent!

    12

    0=

    iw

    e

  • 8/10/2019 Lecture 2 Math

    13/34

    Gradient descent/ascent

    13

    ?

    How to get to the lowest point?

    =

    2

    1

    x

    f

    x

    f

    f

  • 8/10/2019 Lecture 2 Math

    14/34

    Gradient descent

    14

    Small hLarge h

    Too large h

    Gradient escentChoosing the step length

  • 8/10/2019 Lecture 2 Math

    15/34

    Choosing the step length

    The size of the gradient is not relevant!

    15

    ew)

    Small derivative, small steps.

    Large derivative, large steps.

  • 8/10/2019 Lecture 2 Math

    16/34

    Local optima

    Gradient search is not guaranteed to find the

    global minimum/maximum.

    With a sufficiently small step length, the

    closest local optimum will be found.

    16

    Global

    minimumLocal

    minima

  • 8/10/2019 Lecture 2 Math

    17/34

    Online learning vs. batch learning

    Batch learning: Use all training examples to

    update the classifier.

    Most oftenly used.

    Online learning: Update the classifier using

    one training example at the time.

    Can be used when training samples arrive

    sequentially, e.g., adaptive learning.

    Also known as stochastic ascent/descent.

    17

  • 8/10/2019 Lecture 2 Math

    18/34

    Three different cost functions

    Perceptron algorithm

    Linear Discriminant Analysis(a.k.a. Fisher Linear Discriminant)

    Support Vector Machines

    18

  • 8/10/2019 Lecture 2 Math

    19/34

    Perceptron algorithm

    19

    xwT

    i

    in ywww

    =,...,, 10e

    Maximizethe following cost function

    n = # featuresI = set of misclassified training samples

    yi{-1,1} depending on the class of training sample i

    Historically (1960s) one of the first machine learning algorithms

    Negative for misclassified samples

    Considers only misclassified samples

    nwww ,...,, 10e is always negative, or 0 for a perfect classification!

    Linear activation function

  • 8/10/2019 Lecture 2 Math

    20/34

    Perceptron algorithm, cont.

    20

    ik

    i

    i

    k

    xyw

    =

    e

    xwT

    i

    in ywww =,...,,10e

    i

    i

    iy xw

    =

    e

    Algorithm:

    1. Start with a random w

    2. Iterate Eq. 1 until convergence

    Gradient ascent:

    1Eq.1 ii

    ittt y xw

    w

    ww

    =

    = h

    eh

  • 8/10/2019 Lecture 2 Math

    21/34

    Perceptron example

    21

  • 8/10/2019 Lecture 2 Math

    22/34

    Perceptron algorithm, cont.

    If the classes are linearly separable, the

    perceptron algorithm will converge to a

    solution that separates the classes.

    If the classes are not linearly separable, the

    algorithm will fail.

    Which solution it arrives in depends on the

    start value w.

    22

  • 8/10/2019 Lecture 2 Math

    23/34

    Linear Discriminant Analysis (LDA)

    23

    x1

    (w1,w2)T

    x2

    Consider the projected feature vectors

  • 8/10/2019 Lecture 2 Math

    24/34

    LDADifferent projections

    24

    x1

    x2

    Projection

    Optimal projection

    Separating plane

  • 8/10/2019 Lecture 2 Math

    25/34

    LDASeparation

    25

    Small varianceLarge distance

    Large variance

    Small distance

    Goal: minimize variance and maximize distance.

    1

    1

    2

    2

    2

    2

    2

    1

    2

    21

    e

    -=Maximize:

  • 8/10/2019 Lecture 2 Math

    26/34

    LDACost function

    26

    2

    2

    2

    1

    2

    21

    e

    -=

    wCwwCwwCwww totTTT == 21

    2

    2

    2

    1

    Cwwwxxxxwxwxww Tn

    i

    T

    ii

    Tn

    i

    T

    i

    T

    nn

    =

    --==-=

    == 1

    2

    1

    2 11

    Variance:

    C

    xwxwxww Tn

    i

    i

    Tn

    i

    i

    T

    nn=

    ==

    == 11

    11

    Distance: 1x

    Mwwwxxxxwxxwww

    TTTT =--=-=- 2121

    2

    21

    2

    21

    M

    2x

  • 8/10/2019 Lecture 2 Math

    27/34

    LDACost function, cont.

    27

    wCw

    Mwww

    tot

    T

    T

    =-=

    2

    2

    2

    1

    221

    e

    This form is called a (generalized) Rayleigh quotient,

    which is maximized by the largest eigenvector to the

    generalized eigenvalue problem Ctotw= lMw!

    Simplification:

    Some scalar K

    212121 xxwxxxxMw -=--= KT

    211~ xxCw --tot

    Scaling of wnot important!

  • 8/10/2019 Lecture 2 Math

    28/34

    LDA - Examples

    -4 -3 -2 -1 0 1 2 3 4-4

    -3

    -2

    -1

    0

    1

    2

    3

    4

    5

    28

    -4 -3 -2 -1 0 1 2 3 4-4

    -3

    -2

    -1

    0

    1

    2

    3

    4

  • 8/10/2019 Lecture 2 Math

    29/34

    Support Vector Machines (SVM)Idea!

    29

    Optimal separation line remains the same, feature points

    close to the class limits are more important!

    These are called support vectors!

    Support vectors

  • 8/10/2019 Lecture 2 Math

    30/34

    SVMMaximum margin

    30

    22

    w

    w

    Choose wthat gives maximum margin !

  • 8/10/2019 Lecture 2 Math

    31/34

    SVMCost function

    31

    w

    wTx+w0= 1

    wTx+w0= -1

    wTx+w0> 1

    wTx+w0< -1

    wTx+w0= 0

    Scaling of wis freechoose a specific scaling!

    xp

    wT (xp+) +w0= 1

    xs

    wTxs+w0 = 1

    xs= xp+

    wT+ wTxp+ w0 = 1

    0||w|| T = ||w||

    For the support vector, = 1 / ||w||

  • 8/10/2019 Lecture 2 Math

    32/34

    SVMCost function, cont.

    32

    Maximizing= 1 / ||w|| is the same as minimizing ||w||!

    1)(subject to

    min

    0 wy iT

    i xw

    w

    No training samples must reside in the margin region!

    Optimization procedure outside the scope of this course

  • 8/10/2019 Lecture 2 Math

    33/34

    SummaryLinear classifiers

    Perceptron algorithm:

    Of historical interest

    Linear Discriminant Analysis

    Simple to implement

    Very useful as a first classifier to try

    Support Vector Machines

    By many considered as the state-of-the-art classifier

    principle, especially the nonlinear forms (see nextlecture)

    Many software packages exist on the internet

    33

  • 8/10/2019 Lecture 2 Math

    34/34

    What about more than 2 classes?

    Common solution: Combine several binary

    classifiers

    34

    Evaluate most

    likely class