feature-cost sensitive learning · astc, soft 121 x 48 x 46 x 18 x 15 x 8.2 x 6.4 x 8.0 x 6.4 x 5.7...

1
v 3 v 4 v 5 v 6 v 1 v 2 v 0 β 1 , θ 1 β 0 , θ 0 β 2 , θ 2 β 5 β 6 β 3 x > β 0 > θ 0 x > β 1 1 x x > β 4 Matt J. Kusner, Wenlin Chen, Quan Zhou, Zhixiang (Eddie) Xu, Kilian Q. Weinberger, Yixin Chen Department of Computer Science & Engineering, Washington University in St. Louis, USA Feature-Cost Sensitive Learning with Submodular Trees of Classifiers Department of Automation, Tsinghua University, Beijing, China Cost-Sensitive Tree of Classifiers [Xu et al, 2014] Budgeted Learning B [c 1 , ··· ,c d ] > D = {(x (i) ,y (i) )} n i=1 R d Y dataset feature costs cost budget [Chen et al., 2012; Xu et al., 2013] Approximately Submodular Tree of Classifiers v 0 v 1 v 2 ASTC optimization β 2 =(X > X) -1 X > y 2 Results References set function optimization 0 500 1000 1500 0.09 0.095 0.1 0.105 0.11 0.115 0.12 0.125 0.13 Precision @ 5 ASTC ASTC, soft cost-weighted l1-classifier feature cost Yahoo! (higher is better) CSTC test error 0 10 20 30 40 50 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 Forest feature cost 0 50 100 150 200 250 0.32 0.34 0.36 0.38 0.4 0.42 0.44 test error CIFAR feature cost 0 10 20 30 40 50 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 MiniBooNE feature cost test error T RAINING S PEED - UP Y AHOO ! F OREST CIFAR M INI B OO NE C OST B UDGETS 10 52 86 169 468 800 1495 3 5 8 13 23 50 9 24 76 180 239 4 5 12 14 18 33 47 ASTC 119 X 52 X 41 X 21 X 15 X 9.2 X 6.6 X 8.4 X 7.0 X 6.3 X 4.9 X 3.1 X 1.4 X 5.6 X 2.3 X 0.68 X 0.25 X 0.14 X 7.4 X 7.9 X 5.5 X 5.2 X 4.1 X 3.1 X 2.0 X ASTC, SOFT 121 X 48 X 46 X 18 X 15 X 8.2 X 6.4 X 8.0 X 6.4 X 5.7 X 4.5 X 2.8 X 1.5 X 5.3 X 2.3 X 0.62 X 0.27 X 0.13 X 7.2 X 6.2 X 5.9 X 4.2 X 4.3 X 2.5 X 1.7 X Figure 3. Plot of ASTC, CSTC, and a cost-sensitive baseline on a real-world feature-cost sensitive dataset (Yahoo!) and three non-cost sensitive datasets (Forest, CIFAR, MiniBooNE). For Yahoo! circles mark the CSTC points that are used for training time comparison, otherwise, all points are compared. Table 1. Training speed-up of ASTC over CSTC for different cost budgets on all datasets. Expected Tree Cost Expected Tree Loss E[c(T )] = X v l 2L p l " X a c a X v j 2l |β j a | 0 # min β 1 ,...,β |V | E[`(T )] + λE[c(T )] E[`(T )] = 1 n X v k 2V n X i=1 p k i (y (i) - β k > x (i) ) 2 Objective + state-of-the-art performance - expensive global optimization - requires continuous relaxation - difficult to implement ` k (A) , min β k 1 n n X i=1 p k i (y (i) - β k > x (i) A ) 2 (1 - e -γ ) -approximation Greedy Tree Building Results v 0 v 1 v 3 v 4 v 2 (β 2 , θ 2 ) v 5 v 6 CSTC optimization v 0 β 0 , 0 v 1 v 0 β 1 , 1 β 0 , 0 v 1 v 2 v 0 β 1 , 1 β 0 , 0 β 2 , 2 v 3 v 1 v 2 v 0 β 1 , 1 β 0 , 0 β 2 , 2 β 3 Greedy Feature Selection (node k) Fast Selection via QR-Decomposition [Grubb & Bagnell, 2012] [Das & Kempe, 2011] ` k (A) - ` k (A [ a) c a = (q > y ) 2 c a q = X a - QQ > X a kX a - QQ > X a k 2 X A = QR where arg max a2{1,...,d} " ` k (A) - ` k (A [ a) c a # greedy is near-optimal! best loss for features --- A New Objective max A ` k (A) s.t. c(A) B loss reduction per cost instead of a single classifier... Chen, M., Weinberger, K. Q., Chapelle, O., Kedem, D., Xu, Z. Classifier cascade for minimizing feature evaluation cost. AISTATS, 2012 Xu, Z., Kusner, M. J., Weinberger, K. Q., Chen, M. Cost-sensitive tree of classifiers. ICML, 2013 Xu, Z., Kusner, M. J., Weinberger, K. Q., Chen, M., Chapelle, O. Classifier Cascades and Trees for Minimizing Feature Evaluation Cost. JMLR, 2014 goal: a ready practical tool min β `(β , {(x (i) ,y (i) )} n i=1 ) s.t. c(β ) B gradient-based optimization

Upload: others

Post on 30-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feature-Cost Sensitive Learning · ASTC, SOFT 121 X 48 X 46 X 18 X 15 X 8.2 X 6.4 X 8.0 X 6.4 X 5.7 X 4.5 X 2.8 X 1.5 X 5.3 X 2.3 X 0.62 X 0.27 X 0.13 X 7.2 X 6.2 X 5.9 X 4.2 X 4.3

v3

v4

v5

v6

v1

v2

v0

�1, �1

�0, �0

�2, �2

�5

�6

�3

x

>�0 > �0x

>�1 ✓1

x

x

>�4

Matt J. Kusner, Wenlin Chen, Quan Zhou, Zhixiang (Eddie) Xu, Kilian Q. Weinberger, Yixin ChenDepartment of Computer Science & Engineering, Washington University in St. Louis, USA

Feature-Cost Sensitive Learningwith Submodular Trees of Classifiers

Department of Automation, Tsinghua University, Beijing, China

Cost-Sensitive Tree of Classifiers[Xu et al, 2014]

Budgeted Learning

B[c1, · · · , cd]>D = {(x(i), y(i))}ni=1 ⇢ Rd ⇥ Ydataset feature costs cost budget

[Chen et al., 2012; Xu et al., 2013]

Approximately Submodular Tree of Classifiers

v0

v1v3

v4

v2

(�2, �2)v5

v6

v0

v1

v2

CSTC optimization ASTC optimization

�2 = (X>X)�1X>y

✓2

Results

References

set function optimization

0 500 1000 15000.09

0.095

0.1

0.105

0.11

0.115

0.12

0.125

0.13

Prec

isio

n @

5

ASTCASTC, soft

rcost-weighted l1-classifier

feature cost

Yahoo!

(hig

her i

s be

tter)

CSTC

test

err

or

0 10 20 30 40 500.23

0.24

0.25

0.26

0.27

0.28

0.29

0.3

0.31

0.32

Forest

feature cost0 50 100 150 200 250

0.32

0.34

0.36

0.38

0.4

0.42

0.44

test

err

or

CIFAR

feature cost0 10 20 30 40 50

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

MiniBooNE

feature cost

test

err

or

TRAINING SPEED-UPYAHOO! FOREST CIFAR MINIBOONE

COST BUDGETS 10 52 86 169 468 800 1495 3 5 8 13 23 50 9 24 76 180 239 4 5 12 14 18 33 47ASTC 119X 52X 41X 21X 15X 9.2X 6.6X 8.4X 7.0X 6.3X 4.9X 3.1X 1.4X 5.6X 2.3X 0.68X 0.25X 0.14X 7.4X 7.9X 5.5X 5.2X 4.1X 3.1X 2.0X

ASTC, SOFT 121X 48X 46X 18X 15X 8.2X 6.4X 8.0X 6.4X 5.7X 4.5X 2.8X 1.5X 5.3X 2.3X 0.62X 0.27X 0.13X 7.2X 6.2X 5.9X 4.2X 4.3X 2.5X 1.7X

Figure 3. Plot of ASTC, CSTC, and a cost-sensitive baseline on a real-world feature-cost sensitive dataset (Yahoo!) and three non-cost sensitive datasets (Forest, CIFAR, MiniBooNE). For Yahoo! circles mark the CSTC points that are used for training time comparison, otherwise, all points are compared.

Table 1. Training speed-up of ASTC over CSTC for different cost budgets on all datasets.

Expected Tree Cost

Expected Tree Loss

E[c(T )] =X

vl2L

pl"X

a

ca

������

X

vj2⇡l

|�ja|

������0

#

min�1,...,�|V |

E[`(T )] + �E[c(T )]

E[`(T )] = 1

n

X

vk2V

nX

i=1

pki (y(i) � �k>

x

(i))2

Objective

+ state-of-the-art performance- expensive global optimization- requires continuous relaxation- difficult to implement

`k(A) , min�k

1

n

nX

i=1

pki (y(i) � �k>

x

(i)A )2

(1� e��) -approximation

Greedy Tree Building

Results

v0

v1v3

v4

v2

(�2, �2)v5

v6

v0

v1

v2

CSTC optimization ASTC optimization

�2 = (X>X)�1X>y

✓2

v0

�0, ✓0 v1

v0

�1, ✓1

�0, ✓0 v1

v2

v0

�1, ✓1

�0, ✓0

�2, ✓2

v3

v1

v2

v0

�1, ✓1

�0, ✓0

�2, ✓2

�3

Greedy Feature Selection (node k)

Fast Selection via QR-Decomposition

[Grubb & Bagnell, 2012]

[Das & Kempe, 2011]

`k(A)� `k(A [ a)

ca=

(q>y)2

ca

q =Xa �QQ>Xa

kXa �QQ>Xak2

XA = QRwhere

argmax

a2{1,...,d}

"`k(A)� `k(A [ a)

ca

#

greedy is

near-optimal!

best loss for

features ---A

New Objective

max

A`k(A) s.t. c(A) B

loss reduction per cost

instead of a single classifier...

Chen, M., Weinberger, K. Q., Chapelle, O., Kedem, D., Xu, Z. Classifier cascade for minimizing feature evaluation cost. AISTATS, 2012 Xu, Z., Kusner, M. J., Weinberger, K. Q., Chen, M. Cost-sensitive tree of classifiers. ICML, 2013 Xu, Z., Kusner, M. J., Weinberger, K. Q., Chen, M., Chapelle, O. Classifier Cascades and Trees for Minimizing Feature Evaluation Cost. JMLR, 2014

goal: a ready

practical tool

min�

`(�, {(x(i), y(i))}ni=1) s.t. c(�) B

gradient-based

optimization