1 introduction to predictive learning electrical and computer engineering lecture set 2 basic...
TRANSCRIPT
1
Introduction to Predictive Learning
Electrical and Computer Engineering
LECTURE SET 2
Basic Learning Approaches and Complexity Control
2
OUTLINE
2.0 Objectives
2.1 Terminology and Basic Learning Problems
2.2 Basic Learning Approaches
2.3 Generalization and Complexity Control
2.4 Application Example
2.5 Summary
3
2.0 Objectives1. To quantify the notions of
explanation, prediction and model
2. Introduce terminology
3. Describe basic learning methods
• Past observations ~ data points
• Explanation (model) ~ function
Learning ~ function estimation
Prediction ~ using estimated model to make predictions
4
2.0 Objectives (cont’d)• Example: classification
training samples, model
Goal 1: explanation of training data
Goal 2: generalization (for future data)
• Learning is ill-posed
5
Learning as Induction
Induction ~ function estimation from data:
Deduction ~ prediction for new inputs:
6
2.1 Terminology and Learning Problems
• Input and output variables
System
xy
z
x
* * *
* **
**
y
* * *
*
* ** *
**
*
• Learning ~ estimation of F(X): Xy
• Statistical dependency vs causality
7
2.1.1 Types of Input and Output Variables
• Real-valued
• Categorical (class labels)
• Ordinal (or fuzzy) variables
• Aside: fuzzy sets and fuzzy logic
Me
mbe
rsh
ip v
alu
e
Weight (lbs)
75 100 125 150 175 200 225
LIGHT MEDIUM HEAVY
8
Data Preprocessing and Scaling• Preprocessing is required with observational data
(step 4 in general experimental procedure)
Examples: ….• Basic preprocessing includes
- summary univariate statistics: mean, st. deviation, min + max value, range, boxplot performed independently for each input/output
- detection (removal) of outliers
- scaling of input/output variables (may be required for some learning algorithms)
• Visual inspection of data is tedious but useful
9
Example Data Set: animal body&brain weight
kg gram1 Mountain beaver 1.350 8.100
2 Cow 465.000 423.000
3 Gray wolf 36.330 119.500
4 Goat 27.660 115.000
5 Guinea pig 1.040 5.500
6 Diplodocus 11700.000 50.000
7 Asian elephant 2547.000 4603.000
8 Donkey 187.100 419.000
9 Horse 521.000 655.000
10 Potar monkey 10.000 115.000
11 Cat 3.300 25.600
12 Giraffe 529.000 680.000
13 Gorilla 207.000 406.000
14 Human 62.000 1320.000
10
Example Data Set: cont’dkg gram
15 African elephant 6654.000 5712.000
16 Triceratops 9400.000 70.000
17 Rhesus monkey 6.800 179.000
18 Kangaroo 35.000 56.000
19 Hamster 0.120 1.000
20 Mouse 0.023 0.400
21 Rabbit 2.500 12.100
22 Sheep 55.500 175.000
23 Jaguar 100.000 157.000
24 Chimpanzee 52.160 440.000
25 Brachiosaurus 87000.000 154.500
26 Rat 0.280 1.900
27 Mole 0.122 3.000
28 Pig 192.000 180.000
11
Original Unscaled Animal Data: what points are outliers?
12
Animal Data: with outliers removed and scaled to [0,1] range: humans in the left top corner
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Body weight
Bra
in w
eig
ht
13
2.1.2 Supervised Learning: Regression• Data in the form (x,y), where
- x is multivariate input (i.e. vector)
- y is univariate output (‘response’)• Regression: y is real-valued
Estimation of real-valued function xy
-0.5
0
0.5
1
1.5
0 0.2 0.4 0.6 0.8 1
14
2.1.2 Supervised Learning: Classification• Data in the form (x,y), where
- x is multivariate input (i.e. vector)
- y is univariate output (‘response’)• Classification: y is categorical (class label)
Estimation of indicator function xy
15
2.1.2 Unsupervised Learning
• Data in the form (x), where
- x is multivariate input (i.e. vector)
• Goal 1: data reduction or clustering
Clustering = estimation of mapping X c
16
Unsupervised Learning (cont’d)
• Goal 2: dimensionality reduction
Finding low-dimensional model of the data
17
2.1.3 Other (nonstandard) learning problems
• Multiple model estimation:
18
OUTLINE2.0 Objectives
2.1 Terminology and Learning Problems
2.2 Basic Learning Approaches
- Parametric Modeling
- Non-parametric Modeling
- Data Reduction
2.3 Generalization and Complexity Control
2.4 Application Example
2.5 Summary
19
2.2.1 Parametric Modeling
Given training data
(1) Specify parametric model
(2) Estimate its parameters (via fitting to data)• Example: Linear regression F(x)= (w x) + b
minbyn
iii
2
1
)( xw
niyii ,...2,1),,( x
20
Parametric ModelingGiven training data
(1) Specify parametric model
(2) Estimate its parameters (via fitting to data)
Univariate classification:
niyii ,...2,1),,( x
21
2.2.2 Non-Parametric ModelingGiven training data
Estimate the model (for given ) as
‘local average’ of the data.
Note: need to define ‘local’, ‘average’• Example: k-nearest neighbors regression
k
y
f
k
jj
10 )(x
niyii ,...2,1),,( x
0x
22
2.2.3 Data Reduction Approach
Given training data, estimate the model as ‘compact encoding’ of the data.
Note: ‘compact’ ~ # of bits to encode the model• Example: piece-wise linear regression
How many parameters needed
for two-linear-component model?
23
Example: piece-wise linear regression vs linear regression
0 0.2 0.4 0.6 0.8 1-1
-0.5
0
0.5
1
1.5
y
x
24
Data Reduction Approach (cont’d)
Data Reduction approaches are commonly used for unsupervised learning tasks.
• Example: clustering.
Training data encoded by 3 points (cluster centers)
H
Issues:- How to find centers?- How to select the
number of clusters?
25
Inductive Learning Setting
Induction and Deduction in Philosophy:All observed swans are white (data samples).Therefore, all swans are white.• Model estimation ~ inductive step, i.e. estimate
function from data samples.• Prediction ~ deductive step Inductive Learning Setting• Discussion: which of the 3 modeling
approaches follow inductive learning?• Do humans implement inductive inference?
26
OUTLINE2.0 Objectives
2.1 Terminology and Learning Problems
2.2 Modeling Approaches & Learning Methods
2.3 Generalization and Complexity Control
- Prediction Accuracy (generalization)
- Complexity Control: examples
- Resampling2.4 Application Example
2.5 Summary
27
2.3.1 Prediction AccuracyInductive Learning ~ function estimation• All modeling approaches implement ‘data
fitting’ ~ explaining the data• BUT True goal ~ prediction• Two possible goals of learning:
- estimation of ‘true function’- good generalization for future data
• Are these two goals equivalent?• If not, which one is more practical?
28
Explanation vs Prediction
(a) Classification (b) Regression
29
Inductive Learning Setting• The learning machine observes samples (x ,y), and
returns an estimated response
• Recall ‘first-principles’ vs ‘empirical’ knowledge
Two modes of inference: identification vs imitation• Risk
),(ˆ wfy x
min,y),w)) dP(Loss(y, f( xx
30
Discussion• Math formulation useful for quantifying
- explanation ~ fitting error (training data)- generalization ~ prediction error
• Natural assumptions- future similar to past: stationary P(x,y), i.i.d.data- discrepancy measure or loss function, i.e. MSE
• What if these assumptions do not hold?
31
Example: RegressionGiven: training data
Find a function that minimizes squared
error for a large number (N) of future samples:
BUT Future data is unknown ~ P(x,y) unknown
-0.5
0
0.5
1
1.5
0 0.2 0.4 0.6 0.8 1
min,y) dP(,w))f((y xx 2
minwfy kk
N
k
2
1
)],([( x
niyii ,...2,1),,( x),( wf x
32
2.3.2 Complexity Control: parametric modelingConsider regression estimation• Ten training samples
• Fitting linear and 2-nd order polynomial:25.0),,0( 222 whereNxy
33
Complexity Control: local estimationConsider regression estimation• Ten training samples from
• Using k-nn regression with k=1 and k=4:25.0),,0( 222 whereNxy
34
Complexity Control (cont’d)
• Complexity (of admissible models) affects generalization (for future data)
• Specific complexity indices for– Parametric models: ~ # of parameters– Local modeling: size of local region– Data reduction: # of clusters
• Complexity control = choosing good complexity (~ good generalization) for a given (training) data
35
How to Control Complexity ?• Two approaches: analytic and resampling • Analytic criteria estimate prediction error as a
function of fitting error and model complexityFor regression problems:
Representative analytic criteria for regression• Schwartz Criterion:
• Akaike’s FPE:
where p = DoF/n, n~sample size, DoF~degrees-of-freedom
empRn
DoFrR
nppnpr ln11, 1
r p 1 p 1 p 1
36
2.3.3 Resampling• Split available data into 2 sets:
Training + Validation(1) Use training set for model estimation (via data fitting)(2) Use validation data to estimate the prediction error of the model
• Change model complexity index and repeat (1) and (2)
• Select the final model providing lowest (estimated) prediction error
BUT results are sensitive to data splitting
37
K-fold cross-validation
1. Divide the training data Z into k randomly selected disjoint subsets {Z1, Z2,…, Zk} of size n/k
2. For each ‘left-out’ validation set Zi :
- use remaining data to estimate the model
- estimate prediction error on Zi :
3. Estimate ave prediction risk as
)(ˆ xify
2)(
i
yfn
kr ii
Z
x
k
iicv r
kR
1
1
38
Example of model selection(1)• 25 samples are generated as
with x uniformly sampled in [0,1], and noise ~ N(0,1)• Regression estimated using polynomials of degree m=1,2,…,10• Polynomial degree m = 5 is chosen via 5-fold cross-validation. The curve shows the
polynomial model, along with training (* ) and validation (*) data points, for one partitioning.
m Estimated R via Cross validation
1 0.1340
2 0.1356
3 0.1452
4 0.1286
5 0.0699
6 0.1130
7 0.1892
8 0.3528
9 0.3596
10 0.4006
xy 22sin
39
Example of model selection(2)• Same data set, but estimated using k-nn regression.• Optimal value k = 7 chosen according to 5-fold cross-validation
model selection. The curve shows the k-nn model, along with training (* ) and validation (*) data points, for one partitioning.
k Estimated R via Cross validation
1 0.1109
2 0.0926
3 0.0950
4 0.1035
5 0.1049
6 0.0874
7 0.0831
8 0.0954
9 0.1120
10 0.1227
40
More on Resampling• Leave-one-out (LOO) cross-validation
- extreme case of k-fold when k=n (# samples)- efficient use of data, but requires n estimates
• Final (selected) model depends on:- random data- random partitioning of the data into K subsets (folds) the same resampling procedure may yield different model selection results
• Some applications may use non-random splitting of the data into (training + validation)
• Model selection via resampling is based on estimated prediction risk (error).
• Does this estimated error measure reflect true prediction accuracy of the final model?
41
Resampling for estimating true risk
• Prediction risk (test error) of a method can be also estimated via resampling
• Partition the data into: Training/ validation/ test• Test data should be never used for model
estimation• Double resampling method:
- for complexity control- for estimating prediction performance of a method
• Estimation of prediction risk (test error) is critical for comparison of different learning methods
42
Example of model selection for k-NN classifier via 6-fold x-validation: Ripley’s data.Optimal decision boundary for k=14
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
43
Example of model selection for k-NN classifier via 6-fold x-validation: Ripley’s data.Optimal decision boundary for k=50
which one
is better?
k=14 or 50
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
44
Estimating test error of a method• For the same example (Ripley’s data) what is the true
test error of k-NN method ?• Use double resampling, i.e. 5-fold cross validation to
estimate test error, and 6-fold cross-validation to estimate optimal k for each training fold:
Fold # k Validation Test error1 20 11.76% 14%2 9 0% 8%3 1 17.65% 10%4 12 5.88% 18%5 7 17.65% 14%
mean 10.59% 12.8%• Note: opt k-values are different; errors vary for each fold,
due to high variability of random partitioning of the data
45
Estimating test error of a method• Another realization of double resampling, i.e. 5-fold
cross validation to estimate test error, and 6-fold cross-validation to estimate optimal k for each training fold:
Fold # k Validation Test error1 7 14.71% 14%2 31 8.82% 14%3 25 11.76% 10%4 1 14.71% 18%5 62 11.76% 4%
mean 12.35% 12%
• Note: predicted average test error (12%) is usually higher than minimized validation error (11%) for model selection
46
2.4 Application Example
• Why financial applications?- “market is always right” ~ loss function- lots of historical data - modeling results easy to understand
• Background on mutual funds • Problem specification + experimental setup• Modeling results • Discussion
47
OUTLINE
2.0 Objectives
2.1 Terminology and Basic Learning Problems
2.2 Basic Learning Approaches
2.3 Generalization and Complexity Control
2.4 Application Example
2.5 Summary
48
2.4.1 Background: pricing mutual funds
• Mutual funds trivia and recent scandals• Mutual fund pricing:
- priced once a day (after market close) NAV unknown when order is placed
• How to estimate NAV accurately? Approach 1: Estimate holdings of a fund (~200-400 stocks), then find NAVApproach 2: Estimate NAV via correlations btwn NAV and major market indices (learning)
49
2.4.2 Problem specs and experimental setup
• Domestic fund: Fidelity OTC (FOCPX)• Possible Inputs:
SP500, DJIA, NASDAQ, ENERGY SPDR• Data Encoding:
Output ~ % daily price change in NAV
Inputs ~ % daily price changes of market indices
• Modeling period: 2003.
• Issues: modeling method? Selection of input variables? Experimental setup?
50
Experimental Design and Modeling Setup
Mutual FundsMutual Funds Input VariablesInput Variables
YY X1X1 X2X2 X3X3
FOCPXFOCPX ^IXIC^IXIC -- --
FOCPXFOCPX ^GSPC^GSPC ^IXIC^IXIC --
FOCPXFOCPX ^GSPC^GSPC ^IXIC^IXIC XLEXLE
• All variables represent % daily price changes.• Modeling method: linear regression• Data obtained from Yahoo Finance.• Time period for modeling 2003.
Possible variable selection:
51
Specification of Training and Test Data
Year 2003
1, 2 3, 4 5, 6 7, 8 9, 10 11, 12
Training Test
Training Test
Training Test
Training Test
Training Test
Two-Month Training/ Test Set-up Total 6 regression models for 2003
52
Results for Fidelity OTC Fund (GSPC+IXIC)
Coefficients w0 w1 (^GSPC) W2(^IXIC)
Average -0.027 0.173 0.771
Standard Deviation (SD) 0.043 0.150 0.165
Average model: Y =-0.027+0.173^GSPC+0.771^IXIC^IXIC is the main factor affecting FOCPX’s daily price change Prediction error: MSE (GSPC+IXIC) = 5.95%
53
Results for Fidelity OTC Fund (GSPC+IXIC)
Daily closing prices for 2003: NAV vs synthetic model
80
90
100
110
120
130
140
1-Jan-03 20-Feb-03 11-Apr-03 31-May-03 20-Jul-03 8-Sep-03 28-Oct-03 17-Dec-03
Date
Daily A
cco
un
t V
alu
e
FOCPX
Model(GSPC+IXIC)
54
Average Model: Y=-0.029+0.147^GSPC+0.784^IXIC+0.029XLE ^IXIC is the main factor affecting FOCPX daily price change Prediction error: MSE (GSPC+IXIC+XLE) = 6.14%
Coefficients w0 w1 (^GSPC) W2(^IXIC) W3(XLE)
Average -0.029 0.147 0.784 0.029
Standard Deviation (SD) 0.044 0.215 0.191 0.061
Results for Fidelity OTC Fund (GSPC+IXIC+XLE)
55
Results for Fidelity OTC Fund (GSPC+IXIC+XLE)
Daily closing prices for 2003: NAV vs synthetic model
80
90
100
110
120
130
140
1-Jan-03 20-Feb-03 11-Apr-03 31-May-03 20-Jul-03 8-Sep-03 28-Oct-03 17-Dec-03
Date
Da
ily
Ac
co
un
t V
alu
e
FOCPX
Model(GSPC+IXIC+XLE)
56
Effect of Variable Selection
Different linear regression models for FOCPX:• Y =-0.035+0.897^IXIC
• Y =-0.027+0.173^GSPC+0.771^IXIC• Y=-0.029+0.147^GSPC+0.784^IXIC+0.029XLE• Y=-0.026+0.226^GSPC+0.764^IXIC+0.032XLE-0.06^DJI
Have different prediction error (MSE):• MSE (IXIC) = 6.44%• MSE (GSPC + IXIC) = 5.95%• MSE (GSPC + IXIC + XLE) = 6.14%• MSE (GSPC + IXIC + XLE + DJIA) = 6.43%
(1) Variable Selection is a form of complexity control
(2) Good selection can be performed by domain experts
57
Discussion• Many funds simply mimic major indices statistical NAV models can be used for
ranking/evaluating mutual funds
• Statistical models can be used for
- hedging risk and
- to overcome restrictions on trading (market timing) of domestic funds
• Since 70% of the funds under-perform their benchmark indices, better use index funds
58
Summary• Inductive Learning ~ function estimation • Goal of learning (empirical inference):
to act/perform well, not system identification • Important concepts:
- training data, test data- loss function, prediction error (aka risk)- basic learning problems- basic learning methods
• Complexity control and resampling• Estimating prediction error via resampling