statistics860lecture15 cg.wahba2016pages.stat.wisc.edu/~wahba/stat860public/lect/lect15/... ·...

Statistics 860 Lecture 15 c©G. Wahba 2016

Multicategory Support Vector Machines

References

Y. Lee and C. K. Lee Classification of Multiple CancerTypes by Multicategory Support Vector Machines Us-ing Gene Expression Data. Bioinformatics 19 (2003)1132-1139. lee.lee.pdf

Y. Lee, Y. Lin and G. Wahba. Multicategory SupportVector Machines, Theory, and Application to the Clas-sification of Microarray Data and Satellite RadianceData, JASA, March 04lee.lin.wahba.04.pdf

Y. Lee, G. Wahba and S. Ackerman. Classificationof Satellite Radiance Data by Multicategory SupportVector Machines, JTECH Feb 04lee.wahba.ackerman.04.pdf

lee.wahba.ackerman.corr.04.pdf.1

X. Lin. Smoothing Spline Analysis of Variance forPolychotomous Response Data. PhD thesis, Depart-ment of Statistics, Uiversity of Wisconsin, MadisonWI, 1998i, TR 1003. xiwuth.pdf.

G. Wahba. Soft and hard classification by reproducingkernel Hilbert space methods. Proc. NAS 2002, 99,16524-16503. tr1067.pdf

2

♣♣ Multichotomous penalizedlikelihood[xiwuth.pdf].

k + 1 categories, k > 1. Let pj(t) be the probabilitythat a subject with attribute vector t is in category j,∑k

j=0 pj(t) = 1. From [xiwuth.pdf]: Let

f j(t) = log pj(t)/p0(t), j = 1, · · · , k.

Then:

pj(t) = efj(t)

1+∑k

j=1 efj(t), j = 1, · · · , k

p0(t) = 1

1+∑k

j=1 efj(t)

Coding:

yi = (yi1, · · · , yik),

yij = 1 if the ith subject is in category j and 0 other-wise.

(Also J. Zhu and T. Hastie, Biostatistics 5:427-443,2003- fj(t) = (bj, x) with

∑Kj=1 fj(t) = 0 con-

straint)3

♣♣ Multichotomous penalized likelihood (cont.).

Letting f = (f1, · · · , fk) the negative log likelihoodcan be written as −logL(y, f)

=n

∑

i=1

{−k

∑

j=1

yijfj(ti) + log(

k∑

j=1

1 + ef j(ti))}.

where

f j =M∑

νj=1

dνjφν + hj.

λ‖h‖2HKbecomes

k∑

j=1

λj‖hj‖2HK

,

and the optimization problem becomes: Minimize

Iλ(y, f) = −logL(y, f) +k

∑

j=1

λj‖hj‖2HK

.

4

♣♣ Multichotomous penalized likelihood (cont.).

10 year risk of mortality as a function of t = (x1, x2, x3) =age, glycosylated hemoglobin, and systolic blood pres-sure[xiwuth.pdf].

70

age

prob

40 50 60 70 80 90

0.0

0.2

0.4

0.6

0.8

1.0

Figure 34: Cross-section plot of probability surfaces. In each plot, the di�erences betw eenadja-

cen tcurves (from bottom to top) are probabilities for : alive, diabetes, heart attack, other cause

respectively . The points imposed are in the same order. Older onset without taking insulin, those

who died after 10yrs from baseline are considered to be alive. n = 646.

x2 and x3 set at their medians. The differences be-tween adjacent curves (from bottom to top) are prob-abilities pj(t) for : 0:alive, 1: diabetes, 2: heart attack,3: other causes. f j(x1, x2, x3) =

µj + fj1(x1) + f

j2(x2) + f

j3(x3) + f

j23(x2, x3)

(Smoothing Spline ANOVA model.)5

♣♣ Multicategory support vector machines(MSVMs).

From [lee.lin.wahba.04.pdf], [lee.wahba.ackerman.04.pdf,...corr.04.pdf], earlier reports.k > 2 categories. Coding:

yi = (yi1, · · · , yik),k

∑

j=1

yij = 0,

in particular yij = 1 if the ith subject is in category j

and yij = − 1k−1 otherwise. yi = (1,− 1

k−1, · · · ,− 1k−1)

indicates yi is from category 1. The MSVM producesf(t) = (f1(t), · · · fk(t)), with each f j = dj + hj

with hj ∈ HK , required to satisfy a sum-to-zero con-straint

k∑

j=1

f j(t) = 0,

for all t in T . The largest component of f indicatesthe classification.

6

♣♣ Multicategory support vector machines(MSVMs)(cont.).

Let Ljr = 1 for j 6= r and 0 otherwise. The MSVM isdefined as the vector of functions fλ = (f1

λ , · · · , fkλ),

with each hk in HK satisfying the sum-to-zero con-straint, which minimizes

1

n

n∑

i=1

k∑

r=1

Lcat(i)r(fr(ti) − yir)+ + λ

k∑

j=1

‖hj‖2HK

equivalently

1

n

n∑

i=1

∑

r 6=cat(i)

(fr(ti) +1

k − 1)+ + λ

k∑

j=1

‖hj‖2HK

where cat(i) is the category of yi.

The k = 2 case reduces to the usual 2-categorySVM.

The target for the MSVM is f(t) = (f1(t), · · · , fk(t))

with f j(t) = 1 if pj(t) is bigger than the other pl(t)

and f j(t) = − 1k−1 otherwise.

7

♣♣ Multicategory support vector machines(MSVMs)(cont.).

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

x

p1(x)p2(x)p3(x)

0 0.2 0.4 0.6 0.8 1−0.5

0

0.5

1

x0 0.2 0.4 0.6 0.8 1

−0.5

0

0.5

1

x0 0.2 0.4 0.6 0.8 1

−0.5

0

0.5

1

x

Above: Probabilities and target f j ’s for three categorySVM demonstration.(Gaussian Kernel)

0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1

1.5

x0 0.2 0.4 0.6 0.8 1

−1

−0.5

0

0.5

1

1.5

x0 0.2 0.4 0.6 0.8 1

−1

−0.5

0

0.5

1

1.5

x0 0.2 0.4 0.6 0.8 1

−1.5

−1

−0.5

0

0.5

1

1.5

x

The left panel above gives the estimated f1, f2 andf3. λ and σ were optimally tuned. (i. e. with theknowledge of the ‘right’ answer). In the second fromleft panel both λ and σ were chosen by 5-fold crossvalidation in the MSVM and in the third panel theywere chosen by GACV. In the rightmost panel the clas-sification is carried out by a one-vs-rest method.

8

♣♣ Multicategory support vectormachines(MSVMs)(cont.).

The nonstandard MSVM:

More generally, suppose the sample is not represen-tative, and misclassification costs are not equal. Let

Ljr = (πj/πsj)Cjr, j 6= r

Cjr is the cost of misclassifying a j as an r, Crr =

0, πj is the prior probability of category j, and πsj is

the fraction of samples from category j in the trainingset. Then the nonstandard MSVM has as its target theBayes rule, which is to choose the j which minimizes

k∑

ℓ=1

Cℓjpℓ(x)

9

♣♣ Tuning the estimates.

GACV (generalized approximate cross validation).Penalized likelihood:[xiang.wahba.sinica.pdf] [lin.xiwuth.ps];SVM[nips97rr.ps, nips97rr.typos.ps],MSVM[lee.lee.pdf][lee.lin.wahba.04.pdf ].

Leaving out one:

VO(λ) =1

n

n∑

i=1

C(yi, f[i]λ (ti))

where f[i]λ is the estimate without the ith data point.

GACV (λ) =1

n

n∑

i=1

C(yi, f(ti)) + D(y, fλ)

where

D(y, fλ) ≈1

n

n∑

i=1

{

C(yi, f[i]λ (ti)) − C(yi, fλ(ti))

}

is obtained by a tailored perturbation argument. Easyto compute for the SVM, use randomized trace tech-niques to estimate the perturbation in the likelihoodcase.

10

♣♣ 8. The classification of upwelling MODISradiance data to clear sky, water clouds or ice clouds.

From [lee.wahba.ackerman.04.pdf].Classificationof 12 channels of upwelling radiance data from thesatellite- borne MODIS instrument. MODIS is a keypart of the Earth Observing System (EOS).

Classify each vertical profile as coming from clear sky,water clouds, or ice clouds.

Next page: 744 simulated radiance profiles (81 clear-blue, 202 water clouds-green, 461 ice clouds-purple).10 samples from clear, from water and from ice:

11

0.46 0.55 0.66 0.86 1.2 1.6 2.10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

wavelength in microns

Ref

lect

ance

clearliquid cloudice cloud

6.6 7.3 8.6 11 12230

240

250

260

270

280

290

300

wavelength in microns

Brig

htne

ss T

empe

ratu

re

clearliquid cloudice cloud

200 250 300 350−12

−10

−8

−6

−4

−2

0

2

4

6

BTchannel 31

BT ch

anne

l 32−B

T chan

nel 2

9

0 0.2 0.4 0.6 0.8 10.9

1

1.1

1.2

1.3

1.4

1.5

Rchannel 2R

chan

nel 1

/Rch

anne

l 2

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Rchannel 2

log 10

(Rch

anne

l 5/R

chan

nel 6

)

ice cloudswater cloudsclear

Pairwise plots of three different variables (includingcomposite variables.(purple = ice clouds, green = wa-ter clouds, blue = clear)

12

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Rchannel 2

log 10

(Rch

anne

l 5/R

chan

nel 6

)

Classification boundaries on the 374 test set deter-mined by the MSVM using 370 training examples, twovariables, one is composite. Y. K. Lee Student poster prize AMet-

Soc Satellite Meteorology and Oceanography session.

13

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Rchannel 2

log 10

(Rch

anne

l 5/R

chan

nel 6

)

Classification boundaries determined by the nonstan-dard MSVM when the cost of misclassifying clear cloudsis 4 times higher than other types of misclassifica-tions.

14

200 220 240 260 280 300−15

−10

−5

0

5

BTchannel 31

BT ch

anne

l 32−B

T chan

nel 2

9

0 0.2 0.4 0.6 0.8 10.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Rchannel 2

Rch

anne

l 1/R

chan

nel 2

0 0.2 0.4 0.6 0.8 1−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

Rchannel 2

log 10

(Rch

anne

l 5/R

chan

nel 6

)

ice cloudswater cloudsclear

Real Data: Pairwise plots of three different variables(including composite variables). (purple = ice clouds,green = water clouds, blue = clear) 1536 profiles ”La-beled by an expert.” Note remarkable similarity to sim-ulated data!

15

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

Rchannel 2

log 10

(Rch

anne

l 5/R

chan

nel 6

)

Real Data: Classification boundaries on the test setdetermined by the MSVM using training examples, twovariables, one is composite.

16

5 10 15 20 25−0.5

0

0.5

1

1.5

f1

5 10 15 20 25−0.5

0

0.5

1

1.5

f2

5 10 15 20 25−0.5

0

0.5

1

1.5

f3

5 10 15 20 25−0.5

0

0.5

1

1.5

f4

5 10 15 20 250

0.5

1

loss

The first four panels show the predicted decision vec-tors (f1, f2, f3, f4) at the test samples. The four classlabels are coded according as EWS in blue:(1,−1/3,−1/3,−1/3),BL in purple: (−1/3,1,−1/3,−1/3),NB in red: (−1/3,−1/3,1,−1/3),

17

and RMS in green: (−1/3,−1/3,−1/3,1). The col-ors indicate the true class identities of the test sam-ples. We can see from the plot that all the 20 testexamples from 4 classes are classified correctly andthe estimated decision vectors are pretty close to theirideal class representation. The fitted MSVM decisionvectors for the 5 non SRBCT samples are plotted incyan. The last panel depicts the loss for the predicteddecision vector at each test sample. The last 5 lossescorresponding to the predictions of non SRBCTs allexceed the threshold (the dotted line) below whichmeans a strong prediction. Three test samples fallinginto the known four classes can not be classified con-fidently by the same threshold.

How strong is the classification?

A decision vector close to a class code in the multi-class case may mean a strong prediction. Recall themulticlass hinge loss for an observation yi at xi:

C(yi, f(xi)) =k

∑

r=1

Lcat(yi)r(fr(xi) − yir)+

measures the proximity between an MSVM decisionvector and a coded class.

Cross validation heuristics will be used to estimatestrength of a prediction, (standard case). For eachi, leaving out the ith example, collect

C(yi, f[−i](xi)) ≡ C[−i]

where yi is the coded version of the identification madeby f [−i], along with an indicator as to whether yi =

yi, i. e. whether the correct identification was madeby f [−i].

18

Heuristically, with some symmetry assumptions, all ofthe C[−i] associated with a correct classification arethen pooled, and all of the C[−i] associated with anincorrect classification are pooled and can be used toform an estimate of the probability of correct classi-fication, as a function of C(y, f(x)) for future obser-vations. When the training set is completely correctlyclassified, then the 95% of the C[−i] distribution couldbe used.

19

statistics860lecture15 cg.wahba2016pages.stat.wisc.edu/~wahba/stat860public/lect/lect15/... ·...

Documents