introduction to machine learningcs.kangwon.ac.kr/~leeck/ml/04-2.pdf · 2016. 6. 6. ·...
TRANSCRIPT
-
Regression
2
20
,N
,N
|~|
~
|:estimator
xgxrp
xg
xfr
N
t
tN
t
tt
N
t
tt
xpxrp
rxp
11
1
log| log
, log|XL
2
-
Regression: From LogL to Error
2
1
2
12
2
2
1
2
1
2
12
22
1
N
t
tt
N
t
tt
ttN
t
xgrE
xgrN
xgr
||
|log
|exp log|
X
XL
3
2
2
22
1
xxp exp
Least Square Estimates
-
Linear Regression
0101 wxwwwxg
tt ,|
t
t
t
tt
t
t
t
t
t
t
xwxwxr
xwNwr
2
10
10
t
t
tt
t
t
t
t
t
t
t
xr
r
w
w
xx
xN
yw 1
0
2A
yw1A
4
Likelihood or Least Square Estimates partial derivative (MLE)
E q |X( ) =1
2rt - w1x
t +w0( )éë ùût=1
N
å2
-
Polynomial Regression
01
2
2012 wxwxwxwwwwwxgttkt
kkt ,,,,|
5
-
Polynomial Regression – cont’d
01
2
2012 wxwxwxwwwwwxgttkt
kkt ,,,,|
rw TT DDD 1
6
-
Other Error Measures
2
12
1
N
t
tt xgrE ||X
7
Square Error:
Relative Square Error:
Absolute Error: E (θ |X) = ∑t |rt – g(xt| θ)|
ε-sensitive Error:
E (θ |X) = ∑ t 1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| – ε)
2
1
2
1
N
t
t
N
t
tt
rr
xgr
E
|
|X
-
Bias and Variance
8
222 ||| xgExgExgExrExxgxrEE XXXX bias variance
222 xgxrExxrErExxgrE ||||noise squared error
Estimator di = d (Xi) on sample Xi Mean square error = E [(d–)2] = (E [d] – )2 + E [(d–E [d])2] = Bias2 + Variance
-
Estimating Bias and Variance
ti
t i
tti
t
tt
xgM
xg
xgxgNM
g
xfxgN
g
1
1
1
2
22
Variance
Bias
9
M samples Xi={xti , r
ti}, i=1,...,M
are used to fit gi (x), i =1,...,M
-
Bias/Variance Dilemma Example:
gi(x)=2 has no variance and high bias
gi(x)= ∑t rti/N has lower bias with higher variance
As we increase complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with data)
Bias/Variance dilemma: (Geman et al., 1992)
10
-
11
bias
variance
f
gi g
f
-
Polynomial Regression
12
Best fit “min error”
underfitting overfitting
-
13
Best fit, “elbow”
-
Model Selection Cross-validation: Measure generalization accuracy by
testing on data unused during training
Regularization: Penalize complex models
E’= error on data + λ model complexity
Akaike’s information criterion (AIC):
Bayesian information criterion (BIC):
Minimum description length (MDL): shortest description of data
Structural risk minimization (SRM):
14
h is the VC dimension of a model
k: #parameters L: Likelihood
-
Bayesian Model Selection
datamodel model|data
data|modelp
ppp
15
Prior on models, p(model)
Regularization, when prior favors simpler models
Bayes, MAP of the posterior, p(model|data)
Average over a number of models with high posterior (voting, ensembles: Chapter 17)
-
Regression example
16
Coefficients increase in magnitude as order increases: 1: [-0.0769, 0.0016] 2: [0.1682, -0.6657, 0.0080] 3: [0.4238, -2.5778, 3.4675, -0.0002 4: [-0.1093, 1.4356, -5.5007, 6.0454, -0.0019]
i i
N
t
tt wxgrEtionregulariza 22
12
1ww ||: X