best regress 11

Upload: gregoriopiccoli

Post on 10-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Best Regress 11

    1/45

    1

    Oct 29th, 2001Copyright 2001, Andrew W. Moore

    Eight more ClassicMachine Learning

    algorithms

    Andrew W. Moore

    Associate Professor

    School of Computer Science

    Carnegie Mellon Universitywww.cs.cmu.edu/~awm

    [email protected]

    412-268-7599

    Note to other teachers and users of theseslides. Andrew would be delighted if youfound this source material useful in givingyour own lectures. Feel free to use theseslides verbatim, or to modify them to fityour own needs. PowerPoint originals areavailable. If you make use of a significantportion of these slides in your own lecture,please include this message, or thefollowing link to the source repository of Andrews tutorials:http://www.cs.cmu.edu/~awm/tutorials .Comments and corrections gratefullyreceived.

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 2

    8: Polynomial RegressionSo far weve mainly been dealing with linear regression

    :::

    311

    723

    YX2X1

    ::

    11

    23

    :

    3

    7X= y=

    x1=(3,2).. y1=7..

    1

    3

    ::

    11

    21

    :

    3

    7Z= y=

    z1=(1,3,2)..

    zk=(1,xk1,xk2)

    y1=7..

    =(ZT

    Z)-1

    (ZT

    y)

    yest = 0+ 1 x1+ 2 x2

  • 8/8/2019 Best Regress 11

    2/45

    2

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 3

    Quadratic RegressionIts trivial to do linear fits of fixed nonlinear basis functions

    :::

    311

    723

    YX2X1

    ::

    11

    23

    :

    3

    7X= y=

    x1=(3,2).. y1=7..

    1

    2

    1

    9

    1

    6

    1

    3

    ::

    11

    41

    :

    3

    7Z=

    y=

    z=(1 , x1, x2 , x12, x1x2,x2

    2,)

    =(ZTZ)-1(ZTy)

    yest = 0+ 1 x1+ 2 x2+

    3 x12 + 4 x1x2 + 5 x2

    2

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 4

    Quadratic RegressionIts trivial to do linear fits of fixed nonlinear basis functions

    :::

    311

    723

    YX2X1

    ::

    11

    23

    :

    3

    7X= y=

    x1=(3,2).. y1=7..

    1

    2

    1

    9

    1

    6

    1

    3

    ::

    11

    41

    :

    3

    7Z=

    y=

    z=(1 , x1, x2 , x12, x1x2,x2

    2,)

    =(ZTZ)-1(ZTy)

    yest = 0+ 1 x1+ 2 x2+

    3 x12 + 4 x1x2 + 5 x2

    2

    Each component of a z vector is called a term.

    Each column of the Z matrix is called a term column

    How many terms in a quadratic regression with minputs?

    1 constant term

    m linear terms

    (m+1)-choose-2 = m(m+1)/2 quadratic terms

    (m+2)-choose-2 terms in total = O(m

    2

    )

    Note that solving =(ZTZ)-1(ZTy) is thus O(m6)

  • 8/8/2019 Best Regress 11

    3/45

    3

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 5

    Qth-degree polynomial Regression

    :::

    311

    723

    YX2X1

    ::

    11

    23

    :

    3

    7X= y=

    x1=(3,2).. y1=7..

    1

    2

    1

    9

    1

    6

    1

    3

    :

    1

    1

    :

    3

    7Z=

    y=

    z=(all products of powers of inputs inwhich sum of powers is q or less,)

    =(ZTZ)-1(ZTy)

    yest = 0+

    1 x1+

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 6

    m inputs, degree Q: how many terms?= the number of unique terms of the form

    Qqxxxm

    i

    i

    q

    m

    qq m =1

    21where...21

    Qqxxxm

    i

    i

    q

    m

    qqq m ==0

    21where...1 210

    = the number of unique terms of the form

    = the number of lists of non-negative integers [q0,q1,q2,..qm]

    in which qi = Q= the number of ways of placing Q red disks on a row ofsquares of length Q+m = (Q+m)-choose-Q

    Q=11, m=4

    q0=2 q2=0q1=2 q3=4 q4=3

  • 8/8/2019 Best Regress 11

    4/45

    4

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 7

    7: Radial Basis Functions (RBFs)

    :::

    311

    723

    YX2X1

    ::

    11

    23

    :

    3

    7X= y=

    x1=(3,2).. y1=7..

    :

    3

    7Z=

    y=

    z=(list of radial basis function evaluations)

    =(ZTZ)-1(ZTy)

    yest = 0+

    1 x1+

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 8

    1-d RBFs

    yest = 1 1(x) + 22(x) + 3 3(x)

    where

    i(x) = KernelFunction( | x - ci |/ KW)

    x

    y

    c1 c1 c1

  • 8/8/2019 Best Regress 11

    5/45

    5

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 9

    Example

    yest = 21(x) + 0.052(x) + 0.53(x)

    where

    i(x) = KernelFunction( | x - ci |/ KW)

    x

    y

    c1 c1 c1

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 10

    RBFs with Linear Regression

    yest = 21(x) + 0.052(x) + 0.53(x)

    where

    i(x) = KernelFunction( | x - ci |/ KW)

    x

    y

    c1 c1 c1

    All ci s are held constant(initialized randomly or

    on a grid in m-dimensional input space)

    KW also held constant(initialized to be large

    enough that theres decentoverlap between basis

    functions**Usually much better than the crappy

    overlap on my diagram

  • 8/8/2019 Best Regress 11

    6/45

    6

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 11

    RBFs with Linear Regression

    yest = 21(x) + 0.052(x) + 0.53(x)

    where

    i(x) = KernelFunction( | x - ci |/ KW)

    then given Q basis functions, define the matrix Z such that Zkj =KernelFunction( | xk - ci |/ KW) where xk is the kth vector of inputs

    And as before, =(ZTZ)-1(ZTy)

    x

    y

    c1 c1 c1

    All ci s are held constant(initialized randomly or

    on a grid in m-dimensional input space)

    KW also held constant(initialized to be large

    enough that theres decentoverlap between basis

    functions**Usually much better than the crappy

    overlap on my diagram

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 12

    RBFs with NonLinear Regression

    yest = 21(x) + 0.052(x) + 0.53(x)

    where

    i(x) = KernelFunction( | x - ci |/ KW)

    But how do we now find all the js, ci s and KW ?

    x

    y

    c1 c1 c1

    Allow the ci s to adapt tothe data (initialized

    randomly or on a grid inm-dimensional input

    space)

    KW allowed to adapt to the data.

    (Some folks even let each basis

    function have its own

    KWj,permitting fine detail in

    dense regions of input space)

  • 8/8/2019 Best Regress 11

    7/45

    7

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 13

    RBFs with NonLinear Regression

    yest = 21(x) + 0.052(x) + 0.53(x)

    where

    i(x) = KernelFunction( | x - ci |/ KW)

    But how do we now find all the js, ci s and KW ?

    x

    y

    c1 c1 c1

    Allow the ci s to adapt tothe data (initialized

    randomly or on a grid inm-dimensional input

    space)

    KW allowed to adapt to the data.

    (Some folks even let each basis

    function have its own

    KWj,permitting fine detail in

    dense regions of input space)

    Answer: Gradient Descent

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 14

    RBFs with NonLinear Regression

    yest = 21(x) + 0.052(x) + 0.53(x)

    where

    i(x) = KernelFunction( | x - ci |/ KW)

    But how do we now find all the js, ci s and KW ?

    x

    y

    c1 c1 c1

    Allow the ci s to adapt tothe data (initialized

    randomly or on a grid inm-dimensional input

    space)

    KW allowed to adapt to the data.

    (Some folks even let each basis

    function have its own

    KWj,permitting fine detail in

    dense regions of input space)

    Answer: Gradient Descent(But Id like to see, or hope someones already done, ahybrid, where the ci s and KW are updated with gradientdescent while the js use matrix inversion)

  • 8/8/2019 Best Regress 11

    8/45

    8

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 15

    Radial Basis Functions in 2-d

    x1

    x2

    Center

    Sphere ofsignificant

    influence ofcenter

    Two inputs.

    Outputs (heights

    sticking out of page)not shown.

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 16

    Happy RBFs in 2-d

    x1

    x2

    Center

    Sphere ofsignificant

    influence ofcenter

    Blue dots denote

    coordinates ofinput vectors

  • 8/8/2019 Best Regress 11

    9/45

    9

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 17

    Crabby RBFs in 2-d

    x1

    x2

    Center

    Sphere ofsignificant

    influence ofcenter

    Blue dots denotecoordinates ofinput vectors

    Whats theproblem in this

    example?

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 18

    x1

    x2

    Center

    Sphere ofsignificant

    influence ofcenter

    Blue dots denote

    coordinates ofinput vectors

    More crabby RBFs And whats theproblem in thisexample?

  • 8/8/2019 Best Regress 11

    10/45

    10

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 19

    Hopeless!

    x1

    x2

    Center

    Sphere ofsignificantinfluence of

    center

    Even before seeing the data, you shouldunderstand that this is a disaster!

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 20

    Unhappy

    x1

    x2

    Center

    Sphere ofsignificant

    influence ofcenter

    Even before seeing the data, you shouldunderstand that this isnt good either..

  • 8/8/2019 Best Regress 11

    11/45

    11

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 21

    6: Robust Regression

    x

    y

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 22

    Robust Regression

    x

    y

    This is the best fit thatQuadratic Regression can

    manage

  • 8/8/2019 Best Regress 11

    12/45

    12

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 23

    Robust Regression

    x

    y

    but this is what wedprobably prefer

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 24

    LOESS-based Robust Regression

    x

    y

    After the initial fit, scoreeach datapoint according to

    how well its fitted

    You are a very good

    datapoint.

  • 8/8/2019 Best Regress 11

    13/45

    13

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 25

    LOESS-based Robust Regression

    x

    y

    After the initial fit, scoreeach datapoint according to

    how well its fitted

    You are a very gooddatapoint.

    You are not tooshabby.

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 26

    LOESS-based Robust Regression

    x

    y

    After the initial fit, scoreeach datapoint according to

    how well its fitted

    You are a very good

    datapoint.

    You are not tooshabby.

    But you are

    pathetic.

  • 8/8/2019 Best Regress 11

    14/45

  • 8/8/2019 Best Regress 11

    15/45

    15

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 29

    Robust Regression

    x

    y

    For k = 1 to R

    Let (xk,yk) be the kth datapoint

    Let yestkbe predicted value ofyk

    Let wkbe a weight fordatapoint k that is large if thedatapoint fits well and small if itfits badly:

    wk= KernelFn([yk- yest

    k]2

    )

    Then redo the regressionusing weighted datapoints.

    I taught you how to do this in the Instance-based lecture (only then the weights

    depended on distance in input-space)

    Repeat whole thing until

    converged!

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 30

    Robust Regression---what were

    doingWhat regular regression does:

    Assume ykwas originally generated using the

    following recipe:

    yk= 0+ 1 xk+ 2 xk2 +N(0,2)

    Computational task is to find the MaximumLikelihood 0 , 1 and 2

  • 8/8/2019 Best Regress 11

    16/45

    16

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 31

    Robust Regression---what were

    doingWhat LOESS robust regression does:

    Assume ykwas originally generated using the

    following recipe:

    With probability p:

    yk= 0+ 1 xk+ 2 xk2 +N(0,2)

    But otherwise

    yk~ N(,huge2)

    Computational task is to find the MaximumLikelihood 0 , 1 , 2 , p, and huge

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 32

    Robust Regression---what were

    doingWhat LOESS robust regression does:

    Assume ykwas originally generated using the

    following recipe:

    With probability p:

    yk= 0+ 1 xk+ 2 xk2 +N(0,2)

    But otherwise

    yk~ N(,huge2

    )Computational task is to find the MaximumLikelihood 0 , 1 , 2 , p, and huge

    Mysteriously, the

    reweighting proceduredoes this computation

    for us.

    Your first glimpse of

    two spectacular letters:

    E.M.

  • 8/8/2019 Best Regress 11

    17/45

    17

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 33

    5: Regression Trees

    Decision trees for regression

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 34

    A regression tree leaf

    Predict age = 47

    Mean age of recordsmatching this leaf node

  • 8/8/2019 Best Regress 11

    18/45

    18

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 35

    A one-split regression tree

    Predict age = 36Predict age = 39

    Gender?

    Female Male

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 36

    Choosing the attribute to split on

    We cant use

    information gain.

    What should we use?

    725+0YesMale

    :::::

    2400NoMale

    3812NoFemale

    AgeNum. BeanyBabies

    Num.Children

    Rich?Gender

  • 8/8/2019 Best Regress 11

    19/45

    19

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 37

    Choosing the attribute to split on

    MSE(Y|X) = The expected squared error if we must predict a records Y

    value given only knowledge of the records X value

    If were told x=j, the smallest expected error comes from predicting the

    mean of the Y-values among those records in which x=j. Call this mean

    quantity yx=j

    Then

    725+0YesMale

    :::::

    2400NoMale

    3812NoFemale

    AgeNum. BeanyBabiesNum.ChildrenRich?Gender

    = =

    ==X

    k

    N

    j jxk

    jx

    yk yR

    XYMSE1 )such that(

    2)(

    1)|(

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 38

    Choosing the attribute to split on

    MSE(Y|X) = The expected squared error if we must predict a records Y

    value given only knowledge of the records X value

    If were told x=j, the smallest expected error comes from predicting the

    mean of the Y-values among those records in which x=j. Call this mean

    quantity yx=j

    Then

    725+0YesMale

    :::::

    2400NoMale

    3812NoFemale

    AgeNum. BeanyBabies

    Num.Children

    Rich?Gender

    = =

    ==X

    k

    N

    j jxk

    jx

    yk yR

    XYMSE1 )such that(

    2)(

    1)|(

    Regression tree attribute selection: greedilychoose the attribute that minimizes MSE(Y|X)

    Guess what we do about real-valued inputs?

    Guess how we prevent overfitting

  • 8/8/2019 Best Regress 11

    20/45

    20

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 39

    Pruning Decision

    Predict age = 36Predict age = 39

    Gender?

    Female Male

    property-owner = Yes

    # property-owning females = 56712

    Mean age among POFs = 39

    Age std dev among POFs = 12

    # property-owning males = 55800

    Mean age among POMs = 36

    Age std dev among POMs = 11.5

    Use a standard Chi-squared test of the null-hypothesis these two populations have the samemean and Bobs your uncle.

    Do I deserveto live?

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 40

    Linear Regression Trees

    Predict age =

    26 + 6 * NumChildren -

    2 * YearsEducation

    Gender?

    Female Male

    property-owner = Yes

    Leaves contain linearfunctions (trained usinglinear regression on allrecords matching that leaf)

    Predict age =

    24 + 7 * NumChildren -2.5 * YearsEducation

    Also known asModel Trees

    Split attribute chosen to minimizeMSE of regressed children.

    Pruning with a different Chi-squared

  • 8/8/2019 Best Regress 11

    21/45

    21

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 41

    Linear Regression Trees

    Predict age =

    26 + 6 * NumChildren -

    2 * YearsEducation

    Gender?

    Female Male

    property-owner = Yes

    Leaves contain linear

    functions (trained usinglinear regression on allrecords matching that leaf)

    Predict age =

    24 + 7 * NumChildren -2.5 * YearsEducation

    Also known asModel Trees

    Split attribute chosen to minimize

    MSE of regressed children.

    Pruning with a different Chi-squared

    Detail:

    Youtyp

    icallyig

    norea

    ny

    categor

    icalatt

    ributet

    hatha

    sbeen

    tested

    onhigh

    erupin

    thetre

    edurin

    gthe

    regress

    ion.Bu

    tusea

    lluntest

    ed

    attributes,

    andu

    sereal

    -valued

    attribu

    tes

    evenif

    theyveb

    eentes

    tedab

    ove

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 42

    Test your understanding

    x

    y

    Assuming regular regression trees, can you sketch agraph of the fitted function yest(x) over this diagram?

  • 8/8/2019 Best Regress 11

    22/45

    22

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 43

    Test your understanding

    x

    y

    Assuming linear regression trees, can you sketch a graphof the fitted function yest(x) over this diagram?

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 44

    4: GMDH (c.f. BACON, AIM) Group Method Data Handling A very simple but very good idea:

    1. Do linear regression

    2. Use cross-validation to discover whether anyquadratic term is good. If so, add it as a basisfunction and loop.

    3. Use cross-validation to discover whether any of aset of familiar functions (log, exp, sin etc)

    applied to any previous basis function helps. Ifso, add it as a basis function and loop.

    4. Else stop

  • 8/8/2019 Best Regress 11

    23/45

    23

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 45

    GMDH (c.f. BACON, AIM)

    Group Method Data Handling A very simple but very good idea:

    1. Do linear regression

    2. Use cross-validation to discover whether anyquadratic term is good. If so, add it as a basisfunction and loop.

    3. Use cross-validation to discover whether any of aset of familiar functions (log, exp, sin etc)applied to any previous basis function helps. Ifso, add it as a basis function and loop.

    4. Else stop

    Typical learned function:

    ageest = height - 3.1 sqrt(weight) +

    4.3 income / (cos (NumCars))

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 46

    3: Cascade Correlation A super-fast form of Neural Network learning Begins with 0 hidden units

    Incrementally adds hidden units, one by one,doing ingeniously little recomputation betweeneach unit addition

  • 8/8/2019 Best Regress 11

    24/45

    24

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 47

    Cascade beginning

    y2x(m)

    2x(1)2x(0)2:::::

    y1x(m)

    1x(1)1x(0)1

    YX(m)X(1)X(0)

    Begin with a regular dataset

    Nonstandard notation:

    X(i) is the ith attribute

    x(i)k is the value of the ithattribute in the kth record

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 48

    Cascade first step

    y2x(m)

    2x(1)2x(0)2:::::

    y1x(m)

    1x(1)

    1x(0)

    1

    YX(m)X(1)X(0)

    Begin with a regular dataset

    Find weights w(0)i to best fit Y.

    I.E. to minimize

    ==

    =m

    j

    j

    kjk

    R

    k

    kk xwyyy1

    )()0()0(

    1

    2)0(where)(

  • 8/8/2019 Best Regress 11

    25/45

    25

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 49

    Consider our errors

    :

    y2

    y1

    Y

    :

    y(0)2

    y(0)1

    Y(0)

    :

    e(0)2

    e(0)1

    E(0)

    x(m)2x(1)2x(0)2::::

    x(m)1x(1)1x(0)1

    X(m)X(1)X(0)

    Begin with a regular datasetFind weights w(0)i to best fit Y.

    I.E. to minimize

    ==

    =m

    j

    j

    kjk

    R

    k

    kk xwyyy1

    )()0()0(

    1

    2)0(where)(

    )0()0(Define

    kkkyye =

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 50

    Create a hidden unit

    :

    e(0)2

    e(0)

    1

    E(0)

    :

    y2

    y1

    Y

    :

    y(0)2

    y(0)

    1

    Y(0)

    :

    h(0)2

    h(0)

    1

    H(0)

    x(m)2x(1)2x(0)2::::

    x(m)

    1x(1)

    1x(0)

    1

    X(m)X(1)X(0)

    Find weights u(0)i to define a new basisfunction H(0)(x) of the inputs.

    Make it specialize in predicting theerrors in our original fit:

    Find {u(0)i } to maximize correlation

    between H(0)(x) and E(0) where

    =

    =

    m

    j

    j

    j xugH1

    )()0()0()(x

  • 8/8/2019 Best Regress 11

    26/45

    26

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 51

    Cascade next step

    Find weights w(1)i p(1)0 to better fit Y.I.E. to minimize

    )0()0(

    1

    )()0()1(

    1

    2)1(where)( kj

    m

    j

    j

    kjk

    R

    k

    kk hpxwyyy += ==

    :

    h(0)2

    h(0)1

    H(0)

    :

    y(1)2

    y(1)1

    Y(1)

    :

    e(0)2

    e(0)1

    E(0)

    :

    y2

    y1

    Y

    :

    y(0)2

    y(0)1

    Y(0)

    x(m)2x(1)2x(0)2::::

    x(m)1x(1)1x(0)1

    X(m)X(1)X(0)

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 52

    Now look at new errorsFind weights w(1)i p(1)0 to better fit Y.

    :

    y(1)2

    y(1)

    1

    Y(1)

    :

    e(1)2

    e(1)

    1

    E(1)

    :

    h(0)2

    h(0)

    1

    H(0)

    :

    e(0)2

    e(0)

    1

    E(0)

    :

    y2

    y1

    Y

    :

    y(0)2

    y(0)

    1

    Y(0)

    x(m)2x(1)2x(0)2::::

    x(m)

    1x(1)

    1x(0)

    1

    X(m)X(1)X(0)

    )1()1(Define

    kkkyye =

  • 8/8/2019 Best Regress 11

    27/45

    27

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 53

    Create next hidden unit

    Find weights u(1)

    i v(1)

    0 to define a newbasis function H(1)(x) of the inputs.

    Make it specialize in predicting theerrors in our original fit:

    Find {u(1)i , v(1)

    0} to maximize correlation

    between H(1)(x) and E(1) where

    +=

    =

    )0()1(

    0

    1

    )()1()1()( hvxugH

    m

    j

    j

    jx

    :

    e(1)2

    e(1)1

    E(1)

    :

    h(1)2

    h(1)1

    H(1)

    :

    y(1)2

    y(1)1

    Y(1)

    :

    h(0)2

    h(0)1

    H(0)

    :

    e(0)2

    e(0)1

    E(0)

    :

    y2

    y1

    Y

    :

    y(0)2

    y(0)1

    Y(0)

    x(m)2x(1)2x(0)2::::

    x(m)1x(1)1x(0)1

    X(m)X(1)X(0)

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 54

    Cascade nth stepFind weights w(n)i p(n)j to better fit Y.

    I.E. to minimize

    ===

    +=1

    1

    )()(

    1

    )()()(

    1

    2)(where)(

    n

    j

    j

    k

    n

    j

    m

    j

    j

    k

    n

    j

    n

    k

    R

    k

    n

    kk hpxwyyy

    :

    h(1)2

    h(1)1

    H(1)

    :

    :

    y(n)2

    y(n)1

    Y(n)

    :

    e(1)2

    e(1)1

    E(1)

    :

    y(1)2

    y(1)1

    Y(1)

    :

    h(0)2

    h(0)1

    H(0)

    :

    e(0)2

    e(0)1

    E(0)

    :

    y2

    y1

    Y

    :

    y(0)2

    y(0)1

    Y(0)

    x(m)2x(1)2x(0)2

    ::::

    x(m)1x(1)1x(0)1

    X(m)X(1)X(0)

  • 8/8/2019 Best Regress 11

    28/45

    28

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 55

    Now look at new errors

    Find weights w(n)i p(n)j to better fit Y.I.E. to minimize

    )()(Define

    n

    kk

    n

    kyye =

    :

    h(1)2

    h(1)1

    H(1)

    :

    :

    y(n)2

    y(n)1

    Y(n)

    :

    e(n)2

    e(n)1

    E(n)

    :

    e(1)2

    e(1)1

    E(1)

    :

    y(1)2

    y(1)1

    Y(1)

    :

    h(0)2

    h(0)1

    H(0)

    :

    e(0)2

    e(0)1

    E(0)

    :

    y2

    y1

    Y

    :

    y(0)2

    y(0)1

    Y(0)

    x(m)2x(1)2x(0)2

    ::::

    x(m)1x(1)1x(0)1

    X(m)X(1)X(0)

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 56

    Create nth hidden unitFind weights u(n)i v

    (n)i to define a new basis function H

    (n)(x) ofthe inputs.

    Make it specialize in predicting the errors in our previous fit:

    Find {u(n)i , v(n)

    j} to maximize correlation between H(n)(x) and

    E(n) where

    +=

    ==

    1

    1

    )()(

    1

    )()()()(

    n

    j

    jn

    j

    m

    j

    jn

    j

    nhvxugH x

    :

    h(1)2

    h(1)1

    H(1)

    :

    :

    y(n)2

    y(n)1

    Y(n)

    :

    e(n)2

    e(n)1

    E(n)

    :

    h(n)2

    h(n)1

    H(n)

    :

    e(1)2

    e(1)1

    E(1)

    :

    y(1)2

    y(1)1

    Y(1)

    :

    h(0)2

    h(0)1

    H(0)

    :

    e(0)2

    e(0)1

    E(0)

    :

    y2

    y1

    Y

    :

    y(0)2

    y(0)1

    Y(0)

    x(m)2x(1)2x(0)2

    ::::

    x(m)1x(1)1x(0)1

    X(m)X(1)X(0)

    Continue until satisfied with fit

  • 8/8/2019 Best Regress 11

    29/45

    29

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 57

    Visualizing first iteration

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 58

    Visualizing second iteration

  • 8/8/2019 Best Regress 11

    30/45

    30

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 59

    Visualizing third iteration

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 60

    Example: Cascade Correlation for

    Classification

  • 8/8/2019 Best Regress 11

    31/45

  • 8/8/2019 Best Regress 11

    32/45

    32

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 63

    If you like Cascade Correlation

    See Also Projection Pursuit

    In which you add together many non-linear non-parametric scalar functions of carefully chosen directions

    Each direction is chosen to maximize error-reduction fromthe best scalar function

    ADA-Boost

    An additive combination of regression trees in which then+1th tree learns the error of the nth tree

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 64

    2: Multilinear Interpolation

    x

    y

    Consider this dataset. Suppose we wanted to create acontinuous and piecewise linear fit to the data

  • 8/8/2019 Best Regress 11

    33/45

    33

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 65

    Multilinear Interpolation

    x

    y

    Create a set of knot points: selected X-coordinates(usually equally spaced) that cover the data

    q1 q4q3 q5q2

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 66

    Multilinear Interpolation

    x

    y

    We are going to assume the data was generated by anoisy version of a function that can only bend at theknots. Here are 3 examples (none fits the data well)

    q1 q4q3 q5q2

  • 8/8/2019 Best Regress 11

    34/45

    34

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 67

    How to find the best fit?

    Idea 1: Simply perform a separate regression in eachsegment for each part of the curve

    Whats the problem with this idea?

    x

    y

    q1 q4q3 q5q2

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 68

    How to find the best fit?

    x

    y

    Lets look at what goes on in the red segment

    q1 q4q3 q5q2

    h2

    h3

    2332

    2

    3 where)()(

    )( qqwhw

    xqh

    w

    xqxy

    est =

    +

    =

  • 8/8/2019 Best Regress 11

    35/45

    35

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 69

    How to find the best fit?

    x

    y

    In the red segment

    q1 q4q3 q5q2

    h2

    h3

    )()()( 3322 xfhxfhxy

    est

    +=

    w

    xqxf

    w

    qxxf

    =

    = 3322 1)(,1)(where

    2

    (x)

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 70

    How to find the best fit?

    x

    y

    In the red segment

    q1 q4q3 q5q2

    h2

    h3

    )()()( 3322 xfhxfhxyest +=

    w

    xqxf

    w

    qxxf

    =

    = 3322 1)(,1)(where

    2(x)

    3(x)

  • 8/8/2019 Best Regress 11

    36/45

    36

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 71

    How to find the best fit?

    x

    y

    In the red segment

    q1 q4q3 q5q2

    h2

    h3

    )()()( 3322 xfhxfhxy

    est

    +=

    w

    qxxf

    w

    qxxf

    ||1)(,

    ||1)(where 33

    22

    =

    =

    2

    (x)

    3(x)

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 72

    How to find the best fit?

    x

    y

    In the red segment

    q1 q4q3 q5q2

    h2

    h3

    )()()( 3322 xfhxfhxyest +=

    w

    qxxf

    w

    qxxf

    ||1)(,

    ||1)(where 33

    22

    =

    =

    2(x)

    3(x)

  • 8/8/2019 Best Regress 11

    37/45

    37

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 73

    How to find the best fit?

    x

    y

    In general

    q1 q4q3 q5q2

    h2

    h3

    ==KN

    iii

    est

    xfhxy1

    )()(

  • 8/8/2019 Best Regress 11

    38/45

    38

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 75

    In two dimensions

    x1

    x2

    Blue dots showlocations of input

    vectors (outputsnot depicted)

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 76

    In two dimensions

    x1

    x2

    Blue dots showlocations of input

    vectors (outputsnot depicted)

    Each purple dot

    is a knot point.It will containthe height of

    the estimatedsurface

  • 8/8/2019 Best Regress 11

    39/45

    39

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 77

    In two dimensions

    x1

    x2

    Blue dots showlocations of input

    vectors (outputsnot depicted)

    Each purple dotis a knot point.

    It will containthe height of

    the estimatedsurface

    But how do wedo theinterpolation to

    ensure that thesurface is

    continuous?

    9

    7 8

    3

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 78

    In two dimensions

    x1

    x2

    Blue dots showlocations of input

    vectors (outputsnot depicted)

    Each purple dot

    is a knot point.It will containthe height of

    the estimatedsurface

    But how do wedo theinterpolation to

    ensure that thesurface is

    continuous?

    9

    7 8

    3

    To predict the

    value here

  • 8/8/2019 Best Regress 11

    40/45

    40

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 79

    In two dimensions

    x1

    x2

    Blue dots showlocations of input

    vectors (outputsnot depicted)

    Each purple dotis a knot point.

    It will containthe height of

    the estimatedsurface

    But how do wedo theinterpolation to

    ensure that thesurface is

    continuous?

    9

    7 8

    3

    To predict the

    value here

    First interpolate

    its value on two

    opposite edges 7.33

    7

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 80

    In two dimensions

    x1

    x2

    Blue dots showlocations of input

    vectors (outputsnot depicted)

    Each purple dot

    is a knot point.It will containthe height of

    the estimatedsurface

    But how do wedo theinterpolation to

    ensure that thesurface is

    continuous?

    9

    7 8

    3To predict the

    value here

    First interpolate

    its value on two

    opposite edges

    Then interpolate

    between those

    two values

    7.33

    7

    7.05

  • 8/8/2019 Best Regress 11

    41/45

    41

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 81

    In two dimensions

    x1

    x2

    Blue dots showlocations of input

    vectors (outputsnot depicted)

    Each purple dotis a knot point.

    It will containthe height of

    the estimatedsurface

    But how do wedo theinterpolation to

    ensure that thesurface is

    continuous?

    9

    7 8

    3To predict the

    value here

    First interpolate

    its value on two

    opposite edges

    Then interpolate

    between those

    two values

    7.33

    7

    7.05

    Notes:

    This can easily be generalizedto m dimensions.

    It should be easy to see that itensures continuity

    The patches are not linear

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 82

    Doing the regression

    x1

    x2

    Given data, how

    do we find theoptimal knotheights?

    Happily, itssimply a two-

    dimensionalbasis functionproblem.

    (Working out

    the basisfunctions istedious,

    unilluminating,

    and easy)

    Whats theproblem in

    higherdimensions?

    9

    7 8

    3

  • 8/8/2019 Best Regress 11

    42/45

    42

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 83

    1: MARS

    Multivariate Adaptive Regression Splines Invented by Jerry Friedman (one ofAndrews heroes)

    Simplest version:

    Lets assume the function we are learning is of thefollowing form:

    =

    =m

    k

    kk

    estxgy

    1

    )()(x

    Instead of a linear combination of the inputs, its a linearcombination of non-linear functions ofindividual inputs

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 84

    MARS

    ==

    m

    k

    kk

    estxgy

    1

    )()(x

    Instead of a linear combination of the inputs, its a linearcombination of non-linear functions ofindividual inputs

    x

    y

    q1 q4q3 q5q2

    Idea: Each

    gkis one ofthese

  • 8/8/2019 Best Regress 11

    43/45

    43

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 85

    MARS =

    =m

    k

    kk

    estxgy

    1

    )()(x

    Instead of a linear combination of the inputs, its a linearcombination of non-linear functions ofindividual inputs

    x

    y

    q1 q4q3 q5q2

    = =

    =m

    k

    k

    N

    j

    k

    j

    k

    j

    estxfhy

    K

    1 1

    )()(x

  • 8/8/2019 Best Regress 11

    44/45

    44

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 87

    If you like MARS

    See also CMAC (Cerebellar Model ArticulatedController) by James Albus (another ofAndrews heroes)

    Many of the same gut-level intuitions

    But entirely in a neural-network, biologicallyplausible way

    (All the low dimensional functions are bymeans of lookup tables, trained with a delta-

    rule and using a clever blurred update andhash-tables)

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 88

    Where are we now?

    Inputs

    ClassifierPredict

    category

    Inputs

    Density

    Estimator

    Prob-

    ability

    Inputs

    RegressorPredict

    real no.

    Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,Gauss/Joint BC, Gauss Nave BC, N.Neigh, Bayes

    Net Based BC, Cascade Correlation

    Joint DE, Nave DE, Gauss/Joint DE, Gauss NaveDE, Bayes Net Structure Learning

    Linear Regression, Polynomial Regression,Perceptron, Neural Net, N.Neigh, Kernel, LWR,

    RBFs, Robust Regression, Cascade Correlation,Regression Trees, GMDH, Multilinear Interp, MARS

    Inputs

    Inference

    Engine LearnP(E1|E2)

    Joint DE, Bayes Net Structure Learning

  • 8/8/2019 Best Regress 11

    45/45

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 89

    What You Should Know

    For each of the eight methods you should be ableto summarize briefly what they do and outlinehow they work.

    You should understand them well enough thatgiven access to the notes you can quickly re-understand them at a moments notice

    But you dont have to memorize all the details

    In the right context any one of these eight mightend up being really useful to you one day! You

    should be able to recognize this when it occurs.

    Copyright 2001, Andrew W. Moore Machine Learning Favorites: Slide 90

    CitationsRadial Basis FunctionsT. Poggio and F. Girosi, Regularization Algorithms

    for Learning That Are Equivalent to MultilayerNetworks, Science, 247, 978--982, 1989

    LOESS

    W. S. Cleveland, Robust Locally WeightedRegression and Smoothing Scatterplots,

    Journal of the American Statistical Association,74, 368, 829-836, December, 1979

    GMDH etc

    http://www.inf.kiev.ua/GMDH-home/

    P. Langley and G. L. Bradshaw and H. A. Simon,Rediscovering Chemistry with the BACONSystem, Machine Learning: An ArtificialIntelligence Approach, R. S. Michalski and J.

    G. Carbonnell and T. M. Mitchell, MorganKaufmann, 1983

    Regression Trees etcL. Breiman and J. H. Friedman and R. A. Olshen

    and C. J. Stone, Classification and RegressionTrees, Wadsworth, 1984

    J. R. Quinlan, Combining Instance-Based andModel-Based Learning, Machine Learning:Proceedings of the Tenth InternationalConference, 1993

    Cascade Correlation etc

    S. E . Fahlman and C. Lebiere. The cascade-correlation learning architecture. TechnicalReport CMU-CS-90-100, School of Computer

    Science, Carnegie Mellon University,Pittsburgh, PA, 1990.http://citeseer.nj.nec.com/fahlman91cascadecorrelation.html

    J. H. Friedman and W. Stuetzle, Projection Pursuit

    Regression, Journal of the American StatisticalAssociation, 76, 376, December, 1981

    MARS

    J. H. Friedman, Multivariate Adaptive RegressionSplines, Department for Statistics, StanfordUniversity, 1988, Technical Report No. 102