111 predictive learning from data electrical and computer engineering lecture set 7 methods for...
TRANSCRIPT
![Page 1: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/1.jpg)
111
Predictive Learning from Data
Electrical and Computer Engineering
LECTURE SET 7
Methods for Regression
![Page 2: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/2.jpg)
2
OUTLINE of Set 7• Objectives
- introduce taxonomy of methods for regression;- describe several representative nonlinear methods;- empirical comparisons illustrating advantages and limitations of these methods
• Methods taxonomy• Linear methods• Adaptive dictionary methods• Kernel methods and local risk minimization• Empirical comparisons• Combining methods• Summary and discussion
![Page 3: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/3.jpg)
3
Motivation and issues• Importance of regression for implementation of
- classification- density estimation
• Estimation of a real-valued function when data (x,y) is generated as
• Major issues for regression- parameterization (representation) of f(x,w)- optimization formulation (~ empirical loss)- complexity control (model selection)
• These issues are inter-related
noisegy )(x
![Page 4: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/4.jpg)
4
Loss function and noise model• Fundamental problem: how to distinguish
between true signal and noise?
• Classical statistical view- noise density p(noise) is known statistically optimal loss function in the maximum likelihood sense is
for Gaussian noise use squared loss (MSE) as empirical loss function
noisegy )(x
)),(()),(,( wfylogpwxfyL x
![Page 5: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/5.jpg)
5
Loss functions for linear regression• Consider linear regression only
• Several unimodal noise models:- Gaussian, Laplacian, unimodal
• Statistical view:- Optimal loss for known noise density- asymptotic setting- robust strategies when noise model unknown
• Practical situations- noise model unknown- finite (sparse) sample setting
xwx 0),( wwf
![Page 6: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/6.jpg)
6
(a)Linear loss for Laplacian noise (b)Squared loss for Gaussian noise
-3 -2 -1 0 1 2 30
0.5
1
1.5
2
2.5
3
-3 -2 -1 0 1 2 30
0.5
1
1.5
2
2.5
3
![Page 7: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/7.jpg)
7
-insensitive loss (SVM) has common-sense interpretation.Optimal epsilon depends on noise level and sample size
-3 -2 -eps 0 eps 2 3 0
0.5
1
1.5
2
2.5
3
![Page 8: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/8.jpg)
8
Comparison for high-dimensional data:
Gaussian noise Laplacian noise
30,1 n20]1,0[x20214 )( xxxg x
OLS LM SVM.0
1
2
3
4
5
6
7
8
9
10
OLS LM SVM.0
1
2
3
4
5
6
7
8
9
10
OLS LM SVM0
1
2
3
4
5
6
7
8
9
10
OLS LM SVM0
1
2
3
4
5
6
7
8
9
10
![Page 9: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/9.jpg)
9
Methods’ Taxonomy• Recall implementation of SRM:
- fix complexity (VC-dimension)
- minimize empirical risk (squared-loss)
• Two interrelated issues:
- parameterization (of possible models)
- optimization method
• Taxonomy will be based on
parameterization: dictionary vs kernel
flexibility: non-adaptive vs adaptive
![Page 10: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/10.jpg)
10
• Dictionary representation
Two possibilities• Linear (non-adaptive) methods
~ predetermined (fixed) basis functions
only parameters have to be estimated
via standard optimization methods (linear least squares)
Examples: linear regression, polynomial regression
linear classifiers, quadratic classifiers• Nonlinear (adaptive) methods
~ basis functions depend on the training data
Possibilities : nonlinear b.f. (in parameters ), i.e. MLP
feature selection (i.e. wavelet denoising)
xig
g x,vi
wi
f m x , w,V wi g x,vi i 0
m
iv
![Page 11: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/11.jpg)
11
Example Nonlinear Parameterizations• Basis functions of the form
i.e. sigmoid aka logistic function
- commonly used in artificial neural networks- combination of sigmoids ~ universal approximator
)()( iii bgtg xv
tt
exp1
1s
![Page 12: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/12.jpg)
12
Neural Network Representation• MLP or RBF networks
- dimensionality reduction- universal approximation property – see example at http://www.mathworks.com/products/demos/nnettlbx/radial/index.html
W is m 1
1 2 m
V is d m
ˆ y w jz jj 1
m
zj g x ,v j
x1 x2 xd
z1 z2 zm
f m x , w,V wi g x,vi i 0
m
![Page 13: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/13.jpg)
13
Kernel Methods• Model estimated as
where symmetric kernel function is
- non-negative
- radially symmetric
- monotonically decreasing with• Duality between dictionary and kernel representation:
model ~ weighted combination of basis functions
model ~ weighted combination of output values• Selection of kernel functions
non-adaptive ~ depends only on x-values
adaptive ~ depends on y-values of training data
Note: kernel methods may require local complexity control
f x Ki x ,x i i 1
n
yi
K x,x' 0K x,x' K x x'
t x x' limt
K t 0
Ki x ,xi
![Page 14: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/14.jpg)
14
OUTLINE• Objectives• Methods taxonomy• Linear methods
Estimation of linear modelsEquivalent RepresentationsNon-adaptive methodsApplication Example
• Adaptive dictionary methods• Kernel methods and local risk minimization• Empirical comparisons• Combining methods• Summary and discussion
![Page 15: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/15.jpg)
15
Estimation of Linear ModelsDictionary representation
• Parameters w estimated via least-squares• Denote training data as matrix
and vector of response values • OLS solution ~ solving matrix equation
where
m
iiim w
0
g,f xwx
y y1 , . . . ,yn X x1 , .. . , xn
Zw y
Z
g1 x1 . .. gm x1 .. . . . .. . . .. . . . .. . . . .. .
g1 xn . . .gm xn
g1 X g2 X . . . gm X
Remp w 1
nZw y
2
![Page 16: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/16.jpg)
16
Estimation of Linear Models (cont’d)• Solution exists if the columns of Z are linearly
independent (m < n)
• Solving the normal equation
yields OLS solution
• Similar math holds for penalized OLS where
OLS solution
Zw y
ZTZw ZTy
w * ZTZ 1ZTy
Rpen w 1
nZw y
2 wTw w * ZTZ 1
ZTy
![Page 17: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/17.jpg)
17
For dictionary representationOLS solution is
nXn matrix S ~projection matrix
Matrix S ~ ‘equivalent kernel’ of an OLS model w*
f m x , w,V wi g x,vi i 0
m
Equivalent Representation
Column Space of Z
y
y Zw*
ˆ y Zw* Sy
ˆ y Zw* Sy
S Z ZTZ 1ZT
S x, xi g x ZTZ 1gT x i
![Page 18: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/18.jpg)
18
• Equivalent kernel
may not be local• Equivalent ‘kernels’ of a 3-rd degree polynomial
Equivalent Representation (cont’d)
ˆ y Zw* SyS x, xi g x ZTZ 1
gT x i
-0.1
0
0.1
0.2
0.3
0.4
0 0.2 0.4 0.6 0.8 1
![Page 19: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/19.jpg)
19
• Eigenfunction decomposition of a kernel
• The eigenvalues tend to fall off rapidly with I
4 BF’s for kernel
Equivalent BFs for Symmetric Kernel
11
gggi
iii
iii weK xxxxx,
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0 0.2 0.4 0.6 0.8 1
e1 1.0
e2 0.45
e3 0 .10
e4 0.02
2
2
)55.0(2
texptg
![Page 20: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/20.jpg)
20
• Equivalence of representations is due to duality of OLS solution
• Equivalent ‘kernels’ are just math artifacts (may be
non-local). Notational distinction: K vs S
• Practical use of matrix S for:
- analytic form of LOO cross-validation
- estimating model complexity for penalized linear estimators (ridge regression)
Equivalent Representation: summary
ˆ y Zw* Sy
![Page 21: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/21.jpg)
21
• Linear estimator is specified via matrix S. Its complexity ~ the number of parameters m of an equivalent linear estimator
ave variance of training data
• Consider an equivalent linear estimator with matrix where is symmetric of rank m :
so the average variance is
effective DoF of estimator with matrix S is
Estimating Complexity
222
22ˆˆˆvar
Tiiii
iiiii
EEE
EEyEyEy
sssyys
ysys
var ˆ y 2
ntrace SST
˜ S
trace ˜ S S T trace ˜ S rank ˜ S m
var ˆ y 2m
n TSStraceDoF
˜ S
![Page 22: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/22.jpg)
22
Non-adaptive methods• Dictionary representation
basis functions depend only on x-values
• Representative methods include:
- local polynomials (splines) from statisticswhere parameters are knot locations
- RBF networks from neural networkswhere parameters are RBF center and width
Only non-adaptive implementation of RBF will be considered
f m x , w,V wi g x,vi i 0
m
g x,vi
iv
iv
![Page 23: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/23.jpg)
23
Local polynomials and splines• Problem setting: data interpolation(univariate)
problem with polynomials local low-order polynomials
knot location strategies: subset of training samples, or uniformly spaced in x-domain.
-0.5
0
0.5
1
1.5
2
0 0.2 0.4 0.6 0.8 1
![Page 24: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/24.jpg)
24
RBF Networks for Regression• RBF networks
typically local BFs
• Training ~ estimating-parameters of BF’s-linear weights W
- non-adaptive implementation (TBD)- adaptive implementation
W is m 1
1 2 m
V is d m
ˆ y w jz jj 1
m
zj g x ,v j
x1 x2 xd
z1 z2 zm
f m x , w w j gx v j
j
j 1
m
w0
![Page 25: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/25.jpg)
25
Non-adaptive RBF training algorithm1. Choose the number of basis functions
(centers) m.2. Estimate centers using x-values of training data
via unsupervised training (SOM, GLA, clustering etc.)3. Determine width parameters using heuristic:
For a given center (a) find the distance to the closest center:
for all
(b) set the width parameter where parameter controls degree of overlap between adjacent basis functions. Typically
4. Estimate weights w via linear least squares (minimization of the empirical risk).
v j
v j
j
r j mink
vk v jk j
j rj1 3
![Page 26: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/26.jpg)
26
Application Example: Predicting NAV of Domestic Mutual Funds
• Motivation
• Background on mutual funds
• Problem specification + experimental setup
• Modeling results
• Discussion
![Page 27: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/27.jpg)
27
Background: pricing mutual funds• Mutual funds trivia• Mutual fund pricing:
- priced once a day (after market close) NAV unknown when order is placed
• How to estimate NAV accurately? Approach 1: Estimate holdings of a fund (~200-400 stocks), then find NAVApproach 2: Estimate NAV via correlations btwn NAV and major market indices (learning)
![Page 28: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/28.jpg)
28
Problem specs and experimental setup
• Domestic fund: Fidelity OTC (FOCPX)• Possible Inputs:
SP500, DJIA, NASDAQ, ENERGY SPDR• Data Encoding:
Output ~ % daily price change in NAV
Inputs ~ % daily price changes of market indices
• Modeling period: 2003.
• Issues: modeling method? Selection of input variables? Experimental setup?
![Page 29: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/29.jpg)
29
Experimental Design and Modeling Setup
Mutual FundsMutual Funds Input VariablesInput Variables
YY X1X1 X2X2 X3X3
FOCPXFOCPX ^IXIC^IXIC -- --
FOCPXFOCPX ^GSPC^GSPC ^IXIC^IXIC --
FOCPXFOCPX ^GSPC^GSPC ^IXIC^IXIC XLEXLE
• All variables represent % daily price changes.• Modeling method: linear regression• Data obtained from Yahoo Finance.• Time period for modeling 2003.
Possible variable selection:
![Page 30: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/30.jpg)
30
Specification of Training and Test Data
Year 2003
1, 2 3, 4 5, 6 7, 8 9, 10 11, 12
Training Test
Training Test
Training Test
Training Test
Training Test
Two-Month Training/ Test Set-up Total 6 regression models for 2003
![Page 31: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/31.jpg)
31
Results for Fidelity OTC Fund (GSPC+IXIC)
Coefficients w0 w1 (^GSPC) W2(^IXIC)
Average -0.027 0.173 0.771
Standard Deviation (SD) 0.043 0.150 0.165
Average model: Y =-0.027+0.173^GSPC+0.771^IXIC^IXIC is the main factor affecting FOCPX’s daily price change Prediction error: MSE (GSPC+IXIC) = 5.95%
![Page 32: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/32.jpg)
32
Results for Fidelity OTC Fund (GSPC+IXIC)
Daily closing prices for 2003: NAV vs synthetic model
80
90
100
110
120
130
140
1-Jan-03 20-Feb-03 11-Apr-03 31-May-03 20-Jul-03 8-Sep-03 28-Oct-03 17-Dec-03
Date
Daily A
cco
un
t V
alu
e
FOCPX
Model(GSPC+IXIC)
![Page 33: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/33.jpg)
33
Average Model: Y=-0.029+0.147^GSPC+0.784^IXIC+0.029XLE ^IXIC is the main factor affecting FOCPX daily price change Prediction error: MSE (GSPC+IXIC+XLE) = 6.14%
Coefficients w0 w1 (^GSPC) W2(^IXIC) W3(XLE)
Average -0.029 0.147 0.784 0.029
Standard Deviation (SD) 0.044 0.215 0.191 0.061
Results for Fidelity OTC Fund (GSPC+IXIC+XLE)
![Page 34: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/34.jpg)
34
Results for Fidelity OTC Fund (GSPC+IXIC+XLE)
Daily closing prices for 2003: NAV vs synthetic model
80
90
100
110
120
130
140
1-Jan-03 20-Feb-03 11-Apr-03 31-May-03 20-Jul-03 8-Sep-03 28-Oct-03 17-Dec-03
Date
Da
ily
Ac
co
un
t V
alu
e
FOCPX
Model(GSPC+IXIC+XLE)
![Page 35: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/35.jpg)
35
Effect of Variable Selection
Different linear regression models for FOCPX:• Y =-0.035+0.897^IXIC
• Y =-0.027+0.173^GSPC+0.771^IXIC• Y=-0.029+0.147^GSPC+0.784^IXIC+0.029XLE• Y=-0.026+0.226^GSPC+0.764^IXIC+0.032XLE-0.06^DJI
Have different prediction error (MSE):• MSE (IXIC) = 6.44%• MSE (GSPC + IXIC) = 5.95%• MSE (GSPC + IXIC + XLE) = 6.14%• MSE (GSPC + IXIC + XLE + DJIA) = 6.43%
(1) Variable Selection is a form of complexity control
(2) Good selection can be performed by domain experts
![Page 36: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/36.jpg)
36
Discussion• Many funds simply mimic major indices statistical NAV models can be used for
ranking/evaluating mutual funds
• Statistical models can be used for
- hedging risk and
- to overcome restrictions on trading (market timing) of domestic funds
• Since 70% of the funds under-perform their benchmark indices, better use index funds
![Page 37: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/37.jpg)
37
OUTLINE• Objectives• Methods taxonomy• Linear methods• Adaptive dictionary methods
- additive modeling and projection pursuit- MLP networks- CART and MARS
• Kernel methods and local risk minimization• Empirical comparisons• Combining methods• Summary and discussion
![Page 38: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/38.jpg)
38
Additive Modeling & Projection Pursuit
• Additive models have parameterization for regression
where is an adaptive basis function
• Backfitting is a greedy optimization approach for estimating basis functions sequentially:
- basis function is estimated by holding all other basis functions fixed
f x,V g j x,v j j 1
m
w0
g j x,v j
kkg vx,j k
![Page 39: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/39.jpg)
39
• By fixing all basis functions the empirical risk (MSE) can be decomposed as
Each basis function is estimated via an iterative backfitting algorithm (until some stopping criterion is met)
Note: can be interpreted as the response variable for the adaptive method
j k
n
ikiki
n
ikik
kjjiji
n
iiiemp
rn
wyn
yn
R
1
2
1
2
0
1
2
g1
gg1
f1
v,x
v,xv,x
V,xV
kkg vx,
ir kk vx,g
![Page 40: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/40.jpg)
40
• Consider regression estimation of a function of two variables of the form
from training data
For example
Backfitting method: (1) estimate for fixed
(2) estimate for fixed
iterate above two steps• Estimation via minimization of empirical risk
ni ,...,2,1
n
iii
n
iiii
n
iiiiemp
xgrn
xgxgyn
stepfirst
xgxgyn
xgxgR
1
211
1
21122
1
222112211
)(1
)()(1
)_(
)()(1
)(),(
noisexgxgy 2211
Backfitting Algorithm
)2sin(),( 22121 xxxxt
),,( 21 iii yxx 21,0x
11g x 2g 22g x 1g
![Page 41: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/41.jpg)
41
• Estimation of via minimization of MSE
• This is a univariate regression problem of estimating from n data points
where • Can be estimated by smoothing (kNN regression)• Estimation of (second_step) proceeds in a
similar manner, via minimization of
where
minxgrn
xgRn
iiiemp
1
21111 )(
1)(
Backfitting Algorithm(cont’d)
)( 22 iii xgyr 11g x
22g x
)(g 11 x
),( 1 ii rx
n
iiiemp xgr
nxgR
1
22222 )(
1)( )( 11 iii xgyr
![Page 42: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/42.jpg)
42
Projection Pursuit regression• Projection Pursuit is an additive model:
where basis functions are univariate functions (of projections)
• Backfitting algorithm is used to estimate iteratively(a) basis functions (parameters ) via scatterplot smoothing(b) projection parameters (via gradient descent)
f x,V, W g j w j x ,v j j 1
m
w0
g j z,v j
w j
v j
![Page 43: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/43.jpg)
43
EXAMPLE: estimation of a two-dimensional fct via projection pursuit
(a) Projections are found that minimize unexplained variance. Smoothing is performed to create adaptive basis functions.
(b) The final model is a sum of two univariate adaptive basis functions.
-1.5
-1
-0.5
0
0.5
1
-2 -1 0 1 2
g1 z1 r
z1 w1x
g2 z2 r
z2 w2 x
-1
0
1
2
3
4
5
6
-2 -1 0 1 2
x1 x2
z1
z2
y
![Page 44: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/44.jpg)
44
Multilayer Perceptrons (MLP)• Recall MLP networks
for regressionwhere
or
• Parameters (weights) estimated via backpropagation
W is m 1
1 2 m
V is d m
ˆ y w jz jj 1
m
zj g x ,v j
x1 x2 xd
z1 z2 zmg x ,vi s vi 0 xkvikk1
d
s x v i
s t 1
1 exp t
s t tanh t exp t exp t exp t exp t
![Page 45: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/45.jpg)
45
Gradient Descent Learning• Recall batch vs on-line (iterative) learning
- Algorithmic (statistical) approaches ~ batch- Neural-network inspired methods ~ on-line
BUT the difference is only on the implementation level (so both types of learning should yield the same generalization performance)
• Recall ERM inductive principle (for regression):
• Assume dictionary parameterization with fixed basis fcts
n
i
n
iiiiiemp fy
nyL
nR
1 1
2,1
,,1
wxwxw
ˆ y f x,w w j g j x j 1
m
![Page 46: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/46.jpg)
46
Sequential (on-line) least squares minimization• Training pairs presented sequentially• On-line update equations for minimizing
empirical risk (MSE) wrt parameters w are:
(gradient descent learning)where the gradient is computed via the chain rule:
the learning rate is a small positive value (decreasing with k)
)(),( kykx
wxw
ww ,,1 kykLkk k
w j
L x, y, w L
ˆ y
ˆ y
w j
2 ˆ y y gj x
k
![Page 47: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/47.jpg)
47
On-line least-squares minimization algorithm
• Known as delta-rule (Widrow and Hoff, 1960):
Given initial parameter estimates w(0), update parameters during each presentation of k-th training sample x(k),y(k)
• Step 1: forward pass computation
- estimated output• Step 2: backward pass computation
- error term (delta)
zj k gj x(k) j 1,.. .,mˆ y k wj k zj k
j 1
m
k ˆ y k y k w j k 1 w j k k k z j k , j 1,... ,m
![Page 48: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/48.jpg)
48
Neural network interpretation of delta rule• Forward pass Backward pass
1 z1 k zm k
ˆ y k
w0 k w1 k
wm k
1 z1 k zm k
k ˆ y k y k
w j k k k z j k w j k 1 w j k w j k
• Biological learning
x y
w ~ xy
w
Hebbian Rule:
Synapse
![Page 49: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/49.jpg)
49
Learning for a single neuron (delta rule):• Forward pass Backward pass
1 z1 k zm k
ˆ y k
w0 k w1 k
wm k
1 z1 k zm k
k ˆ y k y k
w j k k k z j k w j k 1 w j k w j k
• How to implement gradient-descent learning in a network of neurons?
![Page 50: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/50.jpg)
50
Backpropagation training• Minimization of
with respect to parameters (weights) W, V
• Gradient descent optimization for
where
• Careful application of gradient descent leads
leads to backpropagation algorithm
n
iiiemp yfR
1
2,, VWx
V k 1 V k k gradV L x k ,y k , V k ,w k w k 1 w k k gradw L x k , y k ,V k ,w k
k 1,... ,n,. ..
L x k , y k ,V k ,w k 1
2f x ,w,V y 2
![Page 51: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/51.jpg)
51
Backpropagation: forward passfor training input x(k), estimate predicted output
ˆ y k wj k zj k j0
m
W is m 1
zj k g a j k 1 2 m
V is d m
x1 k x2 k xd k
z1 k zm k z2 k a j k x k v j k
ky
![Page 52: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/52.jpg)
52
Backpropagation: backward passupdate the weights by propagating the error
0 k ˆ y k y k
11 k 12 k 1m k 1 j k 0 k g aj k wj k 1
w j k 1 w j k k 0 k zj k
vij k 1 vij k k 1j k xi k
x1 k x2 k xd k
![Page 53: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/53.jpg)
53
Details of backpropagation• Sigmoid activation - picture?
• simple derivative
Poor behaviour for large t ~ saturation
• How to avoid saturation?- Proper initialization (small weights)
- Pre-scaling of inputs (zero mean, unit variance)
• Learning rate schedule (initial, final)
• Stopping rules, number of epochs
• Number of hidden units
s t 1
1 exp t s t s t 1 s t
![Page 54: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/54.jpg)
54
Additional enhancements• The problem: convergence may be very slow
for error functional with different curvatures:
• Solution: add momentum term to smooth oscillations
where and is momentum parameter
w k 1 w k k z k w k
w k w k w k 1
![Page 55: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/55.jpg)
55
Various forms of complexity control• MLP topology ~ number of hidden units• Constraints on parameters (weights) ~
weight decay• Type of optimization algorithm (many
versions of backprop., other opt. methods)• Stopping rules• Initial conditions (initial ‘small’ weights)• So many factors make it difficult to control
complexity; usually vary 1 complexity factor while keeping all others fixed
![Page 56: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/56.jpg)
56
Toy example: regression• Data set: 25 samples generated using sine-squared
target function with Gaussian noise (st. deviation 0.1). • MLP network
(two hidden units)
underfitting
![Page 57: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/57.jpg)
57
Toy example: regression• Data set: 25 samples generated using sine-squared
target function with Gaussian noise (st. deviation 0.1). • MLP network
(10 hidden units)
near-optimal
![Page 58: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/58.jpg)
58
Backpropagation for classification• Original MLP is for regression
(as introduced above)
• For classification:
- use sigmoid output unit
- during training, use real-values 0/1 for class labels
- during operation, threshold the output of a trained MLP classifier at 0.5 to predict class labels (as in HW2)
W is m 1
1 2 m
V is d m
ˆ y w jz jj 1
m
zj g x ,v j
x1 x2 xd
z1 z2 zm
![Page 59: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/59.jpg)
59
Toy example: classification• Data set: 250 samples ~ mixture of gaussians, where
Class 0 data has centers (-0.3, 0.7) and (0.4, 0.7), and Class 1 data has centers (-0.7, 0.3) and (0.3, 0.3).
The variance of all gaussians is 0.03. • MLP classifier
(two hidden units)
![Page 60: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/60.jpg)
60
Toy example: classification
• MLP classifier (six hidden units)
~ near optimal solution
![Page 61: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/61.jpg)
61
MLP architectures• Supervised learning: single output
i.e., classification, regression• Supervised learning:
multiple outputs
• Unsupervised learning (data compression)
1
1
2
2
k
m
x1 x2 xd
W is m k
V v1 v2 . . .vm V is d m
for each output unit
In matrix notation:
f x,w,V wj s x v j j 1
m
w0
s x v j
F x ,W ,V s xV W
![Page 62: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/62.jpg)
62
NetTalk (Sejnowski and Rosenberg, 1987)
One of the first successful applications of backpropagation:http://www.cnl.salk.edu/ParallelNetsPronounce/index.php
• Goal: Learning to read (English text) aloud, i.e.Learn Mapping: English text phonemesusing MLP network
• Network inputs encode 7-letter window (the 4-th letter in the middle needs to be pronounced)
• Network outputs encode phonemes (used in English)• The MLP network is trained using labeled data (both
individual words and unrestricted text)
![Page 63: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/63.jpg)
63
NetTalk architectureInput encoding: 7x29 = 203 units
Output encoding: 26 units (phonemes) Hidden layer: 80 hidden units
![Page 64: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/64.jpg)
64
MLP networks: summary• MLP and Projection Pursuit models have the same
mathematical parameterization but very different statistical properties:MLP model ~ sum of many basis functions of projections (basis functions are the same)PP model ~ sum of a few basis functions of projections (basis functions are adapted to data)
• Model complexity control for MLP: - may be tricky as it depends on many factors (optimization method, weight initialization, network topology)- in practice, tune just one factor (with others fixed) using resampling
NOTE: implementation of resampling may be tricky (with nonlinear optimization)
![Page 65: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/65.jpg)
65
Regression Trees (CART) • Minimization of empirical risk (squared error)
via partitioning of the input space into regions
• Example of CART partitioning for a function of 2 inputs
f x w jI x R j j 1
m
x1
x2
R1
R2
R3
R4R5
s1
s2
s3
s4
split 1 x1 ,s1
x2 ,s2
x2 ,s3 x1 ,s4
1
2
3 4
R1
R2 R3 R4R5
![Page 66: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/66.jpg)
66
Growing CART tree
• Recursive partitioning for estimating regions (via binary splitting)
• Initial Model ~ Region (the whole input domain) is divided into two regions and
• A split is defined by one of the inputs(k) and split point s• Optimal values of (k, s) chosen so that splitting a region
into two daughter regions minimizes empirical risk• Issues:
- efficient implementation (selection of opt. split)
- optimal tree size ~ model selection(complexity control)• Advantages and limitations
R0R1 R2
![Page 67: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/67.jpg)
67
CART model selection• Model selection strategy
(1) Grow a large tree (subject to min leaf node size)
(2) Tree pruning by selectively merging tree nodes• The final model ~ minimizes penalized risk
where empirical risk ~ MSEnumber of leaf nodes ~ regularization parameter ~
• Note: larger smaller trees • In practice: often user-defined (splitmin in Matlab)
TRR emppen ,
T
![Page 68: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/68.jpg)
68
Example: Boston Housing data set• Objective: to predict the value of homes in Boston • Data set ~ 506 samples total
Output: value of owner-occupied homes (in $1,000’s)Inputs: 13 variables
1. CRIM per capita crime rate by town 2. ZN proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS proportion of non-retail business acres per town 4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX nitric oxides concentration (parts per 10 million) 6. RM average number of rooms per dwelling 7. AGE proportion of owner-occupied units built prior to 1940 8. DIS weighted distances to five Boston employment centres 9. RAD index of accessibility to radial highways 10. TAX full-value property-tax rate per $10,000 11. PTRATIO pupil-teacher ratio by town 12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT % lower status of the population
![Page 69: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/69.jpg)
69
Example CART trees for Boston Housing
1.Training set: 450 samples Splitmin =100 (user-defined)
R0R1 R2
![Page 70: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/70.jpg)
70
Example CART trees for Boston Housing
2.Training set: 450 samples Splitmin =50 (user-defined)
R0R1 R2
![Page 71: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/71.jpg)
71
Example CART trees for Boston Housing3.Training set: 455 samples Splitmin =100 (user-defined)
Note: CART model is sensitive to training samples (vs model 1)
![Page 72: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/72.jpg)
72
Decision Trees: summary• Advantages
- speed
- interpretability
- different types of input variables
• Limitations: sensitivity to
- correlated inputs
- affine transformations (of input variables)
- general instability of trees
• Variations: ID3 (in machine learning), linear CART
![Page 73: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/73.jpg)
73
MARS• MARS features and improvements (over CART)
- continuous approximation (via tensor-product splines)- greedy selection of low-order basis functions- variable selection (local + global)
• MARS complexity control- lack of fit measure based on Generalized Cross Validation (GCV), i.e. MSE on the training set penalized by model complexity (tree size)
• MARS applicability- good for high- and low-D problems with a small number of low-order interactions (local or global)Interaction occurs when the effect of one variable (on the output) depends on the level of other variable
![Page 74: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/74.jpg)
74
MARS Basis FunctionsTruncated Splines
a pair of truncated linear splines
Basic building block
xt xt
(+) (-)
xt
y
b x t
![Page 75: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/75.jpg)
75
Tensor-Product Splines
Multivariate Splines- Tensor product
adaptive selection of knot locations- Valid knot locations
g x,u ,v u j x j v j
q
j 1
d
0 x1
x2
v11 v21 v31
v12
v22
v32
![Page 76: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/76.jpg)
76
MARS Tree Structure
• Each node ~ active basis function• Basis functions estimated recursively• On each path, variables are split at most once
• Depth of tree indicates interaction level
B1 x 1
B2 x B1 x
b x1 t1
B3 x B1 x
b x1 t1
B4 x B1 x
b x2 t2
B5 x B1 x
b x2 t2
B6 x B3 x
b x3 t3
B7 x B3 x
b x3 t3
y aiBi x i1
7
![Page 77: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/77.jpg)
77
Algorithm for MARS• Forward stepwise: search over each node to find
- split variable- split point t- coefficients (weights of basis functions)that minimize lack of fit criterion (GCV)
• Backwards stepwise: remove nodes which cause- decrease of gcv or- the smallest increase of gcv
• GCV criterionwhere
r p 1 p 2R rhl
n
Remp
hmars 1 m
![Page 78: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/78.jpg)
78
MARS Summary
• Advantages- Provides variable subset selection - Continuous approximation- Works well for low-order interactions and
additive functions- Interpretable• Limitations
- sensitive to coordinate rotation
- problems in dealing with collinear variables
- stability of MARS modeling, for small samples
![Page 79: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/79.jpg)
79
OUTLINE• Objectives• Methods taxonomy• Linear methods• Adaptive dictionary methods• Kernel methods and local risk minimization
- kernel methods and local risk minimization- Generalized Memory-Based Learning - Constrained Topological Mapping
• Empirical comparisons• Combining methods• Summary and discussion
![Page 80: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/80.jpg)
80
Local Risk Minimization• Local learning (memory-based learning):
estimate a function at a single point x0
• Local risk minimization
local neighborhood functionnormalizing function
• The goal is minimization of local prediction risk over a set of and over the kernel width using only training data
R , ;x0 L y, f x , K x, x0 x0
p x ,y dxdy
K x,x0 x0 K x, x0 p x dx
f x ,
![Page 81: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/81.jpg)
81
Practical Implementation of LRM• Sumultaneous minimization of local risk over
and over kernel width is hard • Practical methods assume fixed (constant or
linear) parameterization and then adjust only the kernel width.
Local Estimation at a point x0 :(1) Select approximating functions of fixed low
complexity, and select kernel function (i.e. gaussian or hard threshold).
(2) Select optimal kernel width, providing min estimated local risk. That is, selectively decrease training sample (near x0 ) to make an estimate.
f x ,
![Page 82: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/82.jpg)
82
LRM and Kernel Methods • Consider minimization of Local Empirical Risk
assuming constant parameterization local average
(similarly, for local linear parameterization)• Solution to LRM leads to adaptive kernel
method, because the kernel width is adapted to data at each estimation point x0. However, adaptive selection of kernel width is hard.
Remp local 1
nK x i , x0 yi f x i , 2
i 1
n
f x ,w0 w0
f x0 w0 1
nyiK x i ,x0
i 1
n
![Page 83: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/83.jpg)
83
Practical Selection of Kernel Width • Global Adaptive Approach: the kernel width is
estimated globally, independent of a particular estimation point.
• Global model selection for k-nn regression:For a given value of k:(1) Compute a local estimate at each input(2) Compute total empirical risk of these estimates
(3) Estimate prediction risk using (analytic) model selection criterion.
Minimize this risk through appropriate selection of k.
x iˆ y i
Remp k 1
nyi ˆ y i 2
i 1
n
![Page 84: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/84.jpg)
84
Generalized Memory-Based Learning• For a given new input, an output is
estimated via local learning using past data
• GMBL implements locally weighted linear approximation minimizing
where the kernel
has adaptable width and scale parameters estimated via cross-validation using all data
Remp local w ,w0 1
nK xi ,x0 wx i w0 yi 2
i 1
n
K x, x ,v xk x k 2vk
2
k1
d
q
![Page 85: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/85.jpg)
85
Constrained Topological MappingRecall applying SOMto regression problem
NonadaptiveCTM Approach:Given training data (x,y) perform1. Dimensionality reduction xz
(Apply SOM to x-values of training data)2. Apply kernel regression to estimate
y=f(z) at discrete points in z-space
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 0.2 0.4 0.6 0.8 1
![Page 86: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/86.jpg)
86
Adaptive CTM Implementation• Batch implementation• Local linear modeling for each CTM unit
• Variable selection via adaptive scaling
where
• Final neighborhood width (model complexity) selected via cross-validation
Remp local w j ,w0 j 1
nK zi , j w j x i w0 j yi
i 1
n
2
c j x i v
2 v l
2 c jl xil 2
l 1
d
vl w jlj 1
b
![Page 87: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/87.jpg)
87
OUTLINE• Objectives
• Methods taxonomy
• Linear methods
• Adaptive dictionary methods
• Kernel methods and local risk minimization
• Empirical comparisons
• Combining methods
• Summary and discussion
![Page 88: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/88.jpg)
88
Empirical ComparisonsRef: Cherkassky et al. (1996), Comparison of adaptive
methods for function estimation from samples, IEEE Transactions on Neural Networks, 7, 969-984
• Challenge of comparisons
- who performs comparisons (experts vs general users)
- goals of comparison
- synthetic vs real-life data
- importance of experimental procedure (i.e. for model selection, double resampling etc.)
![Page 89: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/89.jpg)
89
Example comparison studyTime series prediction (Weigend & Gershenfeld 1992)
• Performed by experts on time series 1K-100K samples long
• Lessons learned/ conclusions
- knowledge of application domain is important (simplistic black-box approaches usually fail)
- successful methods are nonlinear
- custom/manual control of method’s parameters (model selection)
![Page 90: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/90.jpg)
90
Application of Adaptive Methods
1. Choose flexible method (parameterization)
2. Choose complexity parameter
- automatic (from data) or user-selected
3. Estimate model (from training data)
4. Estimate prediction performance(on test data)
NOTE: empirical comparison (of methods) is difficult because prediction performance depends on all factors (1) - (3), in addition to data itself
![Page 91: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/91.jpg)
91
Example Comparison Study
• Objectives and assumptions- non-expert users
- public-domain s/w for regression methods
- manual model selection using test set; just 1 or 2 user-defined parameters
- off-line training (batch mode)
- comparison focus on methods’ parameterization (1) and model selection (2) for different synthetic data sets
![Page 92: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/92.jpg)
92
Objectives and assumptions (cont’d)• Methods in XTAL package
k- nearest neighbors regression (k-NN)Linear Regression (LR)Projection Pursuit (PPR)Multivariate Adaptive Regression Splines (MARS)Generalized Memory-Based Learning (GMBL)Constrained Topological Mapping (CTM)Artificial Neural Network / backpropagation (ANN)
• Synthetic datalow- and high-dimensionaluniformly distributed in x-space
![Page 93: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/93.jpg)
93
Experimental Set-Up• Specification of
properties of synthetic data: target functions, training/test set size, x-distribution, noise level
performance metric: NRMS error (for test set)4 parameter settings for each method:
KNN: k = 2, 4, 8, 16
GMBL: no parameters (run only once)
CTM: smoothing parameter = 0, 2, 5, 9
MARS: smoothing parameter = 0, 2, 5, 9
PPR: number of terms (in the smallest model) = 1, 2, 5, 8
ANN: number of hidden units = 5, 10, 20, 40
![Page 94: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/94.jpg)
94
• Training Data- uniform distribution (random and spiral)
- size: smalle (25), medium (100), large (400)
- noise level: none, medium(SNR=4), large (SNR=2)
• Test data: 961 samples (no noise)- spaced uniformly on 2D grid (for 2-dimensional data)
- randomly sampled for high-dimensional data
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
X1
X2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
X1
X2
![Page 95: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/95.jpg)
95
• 2D Target functions
Function 3 Function 4
Function 1 Function 2
![Page 96: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/96.jpg)
96
• 2D Target functions (cont’d)
Function 5 Function 6
Function 7 Function 8
![Page 97: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/97.jpg)
97
Comparison SummaryBEST WORST
Prediction accuracy (dense samples) ANN KNN, GMBLPrediction accuracy (sparse samples)GMBL, KNN MARS, PPAdditive target functions MARS, PP KNN, GMBLHarmonic functions CTM, ANN PPRadial functions ANN, PP KNNRobustness wrt parameter tuning ANN, GMBL PPRobustness wrt sample properties ANN, GMBL PP, MARS
• Methods performance- similar at dense (large) samples- uneven at sparse samples and depends significantly on the properties of data
![Page 98: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/98.jpg)
98
Comments on Specific Methods
• Comparison metrics:
(a) generalization
(b) robust parameter tuning
(c)robust to data characteristics
• kNN and GMBL: (a) inferior to other methods when accurate prediction is possible
(b) very robust
(c) very robust
![Page 99: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/99.jpg)
99
Comments on Specific Methods
• MARS(a) good for additive functions(b) somewhat brittle(c) rather unpredictable
• PPR (a) good for additive functions, functions of linear combinations of inputs; poor for harmonic functions(b) brittle(c) rather unpredictable
![Page 100: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/100.jpg)
100
Comments on Specific Methods
• ANN(a) good for functions of linear combinations, harmonic and radial-type functions(b) very robust(c) very predictable
• CTM (a) very good for harmonic functions, poor functions of linear combinations(b) robust(c)predictable; best for spiral distribution in x-space
![Page 101: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/101.jpg)
101
Conclusions and Caveats• Comparison results always biased by
- selection of data sets- s/w implementation of adaptive methods- (expert) user bias
• Relative performance varies with properties of data sets (i.e. sample size, noise level etc)
• Heuristic optimization methods (ANN, CTM) are computationally intensive but often more robust than faster statistical methods
• Nonlinear methods should be robust: only for robust methods it is possible to develop automatic parameter tuning (complexity control)
![Page 102: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/102.jpg)
102
OUTLINE• Objectives
• Methods taxonomy
• Linear methods
• Adaptive dictionary methods
• Kernel methods and local risk minimization
• Empirical comparisons
• Combining methods
• Summary and discussion
![Page 103: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/103.jpg)
Motivation for Combining Methods
• General setting (used in this course)
- given training data set
- apply different learning methods
- select the best model (method)
Learning Method + Data Predictive Model
• Why discard other models?
![Page 104: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/104.jpg)
Motivation (cont’d)
Learning Method + Data Predictive Model• Theoretical and empirical evidence
- no single ‘best’ method exists• Always possible to find:
- best method for given data set- best data set for given method
• Philosophical + statistical connections, Eastern philosophy, Bayesian averaging:
Combine several theories (models) explaining the data
![Page 105: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/105.jpg)
Strategies for Combining Methods
• Predictive model depends on 3 factors (a) parameterization of admissible models(b) random training sample(c) empirical loss (for risk minimization)
• Three combining strategies (for improved generalization)
1. Different (a), the same (b) and (c) Committee of Networks, Stacking, Bayesian averaging2. Different (b), the same (a) and (c) Bagging3. Different (c), the same (a) and (b) Boosting
![Page 106: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/106.jpg)
Combining Strategy 1• Apply N different methods (parameterizations) to
the same data N distinct models• Form (linear) combination of N models
![Page 107: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/107.jpg)
Combining Strategy 1 (cont’d)
Design issues:• What parameterizations (methods) to use?
- as different as possible
• How many component models?
• How to combine component models?
- via empirical risk minimization (neural network strategy)
- Bayesian averaging (statistical strategy)
![Page 108: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/108.jpg)
Committee of networks approachGiven training data • Estimate N candidate (regression) models
using different methods• Construct the combined model as
where coefficients are estimated via min. of emp. riskunderconstraints
x i ,yi i 1, . . . ,n
**22
*11 ,,...,,,, NNfff xxx
N
jjjjcom f
Nf
1
*,1
, xx
n
iiicom y
nR
1
21 ,f x 11
N
jj
j 0
j
![Page 109: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/109.jpg)
Example of Committee Approach• Regression data set:
with x-values uniform in [0,1] and noise variance
• Regression methods used(a) polynomial(b) trigonometric(c) combined (Committee of Networks)
• Model selection:
- VC model selection for (a) and (b)- empirical risk minimization for (c)
y 0.8sin 2 x 0 .2x 2
2 0 .25
f trig x ,vm2, wm2
v j sin jx w j cos jx j 1
m2 1
w0
f comb1 x , fpoly x,um1 1 f trig x,vm2
,wm 2
f poly x ,um1 uj x
j
j 0
m1 1
![Page 110: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/110.jpg)
Comparison: 25 training samplesRed ~ target function; Blue dashed ~ polynomial modelBlue dotted ~ trigonometric model; Black ~ combined model
![Page 111: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/111.jpg)
Comparison: 50 training samplesRed ~ target function; Blue dashed ~ polynomial modelBlue dotted ~ trigonometric model; Black ~ combined model
![Page 112: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/112.jpg)
Comparison results
25 training samples: Model f(x) MSE(f(x),target) Poly (d=3) 0.0857
Trigon (d=3) 0.0237
Combined 0.0239
(alpha=0.5)
50 training samples: Model f(x) MSE(f(x),target) Poly (d=4) 0.0046
Trigon (d=4) 0.0044
Combined 0.0038
(alpha=0.2)
![Page 113: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/113.jpg)
Stacking approachGiven training data • Estimate N candidate (regression) models
using different methods• Construct the combined model as
where coefficients are estimated via resamplingunderconstraints
x i ,yi i 1, . . . ,n
**22
*11 ,,...,,,, NNfff xxx
N
jjjjcom f
Nf
1
*,1
, xx
n
iiicom y
nR
1
21 ,f x 11
N
jj
j 0
j
![Page 114: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/114.jpg)
114
Empirical Comparison• The same data set and experimental setup• Committee approach ~ Comb 1
Stacking approach ~ Comb 2
0.26
0.28
0.3
0.32
0.34
0.36
poly trig comb1 comb2 k-nn
method
Risk, noise variance 0.25, n=50
![Page 115: 111 Predictive Learning from Data Electrical and Computer Engineering LECTURE SET 7 Methods for Regression](https://reader035.vdocument.in/reader035/viewer/2022081515/56649e4c5503460f94b41a68/html5/thumbnails/115.jpg)
115
Summary and Discussion• Linear (nonadaptive) methods for regression
- theoretically well-understood
- effective methods for complexity control• Nonlinear (adaptive) methods
- inherently complex (non-tractable optimization)
- difficult to apply analytic model selection and resampling
- no single best method exists for all data sets• Combining methods often result in better
predictions