![Page 1: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/1.jpg)
Model Selection via Bilevel Optimization
Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang
Department of Mathematical SciencesRensselaer Polytechnic InstituteTroy, NY
![Page 2: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/2.jpg)
Convex Machine LearningConvex optimization approaches to machine
learning has been major obsession of machine learning for last ten years.
But are the problems really convex?
![Page 3: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/3.jpg)
Outline The myth of convex machine learning Bilevel Programming Model Selection
Regression Classification
Extensions to other machine learning tasks Discussion
![Page 4: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/4.jpg)
Modeler’s ChoicesData
Function
Loss/Regularization
1 1
2 1
1
y
y
y
x
x
x
2
1' 22
min max(| | ,0) || ||i ii
C y x w w
Optimization Algorithm
ˆ( )f x x w
CONVEX!
w( )f x x w
![Page 5: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/5.jpg)
Many Hidden Choices Data:
Variable Selection Scaling Feature Construction Missing Data Outlier removal
Function Family: linear, kernel (introduces kernel parameters)
Optimization model loss function regularization Parameters/Constraints
![Page 6: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/6.jpg)
Data
Function
Loss/Regularization
Cross-Validation Strategy
Generalization Error
, , 1t ttraitest tn T
[ ], ,[ ] Xy yX
1
1 1| |
t
Tt
i it itT
x w y
NONCONVEX
Cross-Validation:
C,ε, [X,y]
Optimization Algorithm
w
1 22
min max(| | ,0) || ||i ii
C y x w w
ˆ( )f x x w
ˆ( )f x x w
![Page 7: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/7.jpg)
How does modeler make choices? Best training set error Experience/policy Estimate of generalization error
Cross-validation Bounds
Optimize generalization error estimate Fiddle around. Grid Search Gradient methods Bilevel Programming
![Page 8: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/8.jpg)
Splitting Data for T-fold CV
-th partition
t
disjoint partitions of
T
validation set for -th fold
t
t
\ training set for -th fold
t t
t
![Page 9: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/9.jpg)
CV via Grid SearchFor every C, ε • For every validation set, Solve model on corresp.
training set, and to estimate loss for
•Estimate generalization error for C, ε
Return best values for C,εMake final model using C,ε
C
ε
,C
t
tt
![Page 10: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/10.jpg)
Bilevel Program for T folds
Prior Approaches: Golub et al., 1979, Generalized Cross-Validation for one parameter in Ridge Regression
CV as Continuous Optimization Problem
,1
2
2
1min
1 s.t. arg min max ,0
2
1,..,
t
t
Tt
i iC
t it
tj j
j
yT
C y
t T
w
x w
w x w wT inner-level training problems
Outer-level validation problem
![Page 11: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/11.jpg)
Benefit: More Design Variables
Add feature box constraint: in the inner-level problems.
w w w
, ,1
2
2
1min
1 s.t. arg min max ,0
2
t
t
Tt
i iCt it
tj j
j
yT
C y
w
w w w
x w
w x w w
![Page 12: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/12.jpg)
-insensitive Loss Function max ,0z
![Page 13: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/13.jpg)
Inner-level Problem for t-th Fold 2
2
2
2,
1min max , 0
2
1 min
2
s.t.
t
t
t t
t
t tj j
j
t tj
j
t tj j j
t tj j j
C y
C
y
y
w w w
w ξ
x w w
w
x w
x w
0tj
t
w w w
![Page 14: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/14.jpg)
Optimality (KKT) conditions for fixed , ,C w
, t
, t
, ,
, t
, t
t , , , ,
j
0 + + 0
0 + + 0
0 0
0 0
0 0
0 =t
t tj j j jt t
j j j j tt t tj j j
t
t
t t t tj j j
y
y j
C
x w
x w
γ w w
γ w w
w x γ γ
where 0 a b a b
![Page 15: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/15.jpg)
Key Transformation KKT for the inner level training problems
are necessary and sufficient Replace lower level problems by their
KKT Conditions Problem becomes an Mathematical
Programming Problem with Equilibrium Constraints (MPEC)
![Page 16: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/16.jpg)
Bilevel Problem as MPEC
, , , 1
, t
, t
, ,
, t
, t
1min
s.t. for 1,...,
0 + + 0
0 + + 0
0 0
0 0
0 0
0 =
t
t
Tt
i iC t it
t tj j j jt t
j j j j tt t tj j j
t
t
yT
t T
y
y j
C
w w
x w
x w
x w
γ w w
γ w w
w t , , , ,
j t
t t t tj j j
x γ γ
Replace T inner-level problems with
corresponding optimality conditions
![Page 17: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/17.jpg)
MPEC to NLP via Inexact Cross ValidationRelax “hard” equilibrium constraints
to “soft” inexact constraints
tol is some user-defined tolerance.
tol tol
, 00 0
a ba b
a b
, 00 0
0
a ba b
a b
![Page 18: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/18.jpg)
Solvers Strategy: Proof of concept using nonlinear general
purpose solvers from NEOS on NLP FILTER, SNOPT Sequential Quadratic Programming Methods FILTER results almost always better.Many possible alternatives: Integer Programming Branch and Bound Lagrangian Relaxations
![Page 19: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/19.jpg)
Computational Experiments: DATASynthetic (5,10,15)-D Data with Gaussian and Laplacian
noise and (3,7,10) relevant features. NLP: 3-fold CV Results: 30 to 90 train, 1000 test points, 10 trialsQSAR/Drug Design 4 datasets, 600+ dimensions reduced to 25 top
principal components. NLP: 5-fold CV Results: 40 – 100 train, rest test, 20 trials
![Page 20: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/20.jpg)
Cross-validation Methods Compared Unconstrained Grid: Try 3 values each for C,ε Constrained Grid: Try 3 values each for C, ε, and {0, 1} for each component of Bilevel/FILTER: Nonlinear program
solved using off-the-shelf SQP algorithm via NEOS
w
![Page 21: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/21.jpg)
15-D Data: Objective Value
0
0.5
1
1.5
2
2.5
3
15 pts 30 pts 60 pts 90 pts
Unc GridCon GridFilter
![Page 22: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/22.jpg)
15-D Data: Computational Time
0
100
200
300
400
500
600
700
800
900
15 pts 30 pts 60 pts 90 pts
Unc GridCon GridFilter
![Page 23: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/23.jpg)
15-D Data: TEST MAD
0
0.5
1
1.5
2
2.5
3
15 pts 30 pts 60 pts 90 pts
Unc GridCon GridFilter
![Page 24: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/24.jpg)
QSAR Data: Objective Value
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Aquasol BBB Cancer CCK
Unc GridCon GridFilter
![Page 25: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/25.jpg)
QSAR Data: Computation Time
0
200
400
600
800
1000
1200
1400
Aquasol BBB Cancer CCK
Unc GridCon GridFilter
![Page 26: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/26.jpg)
QSAR Data: TEST MAD
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Aquasol BBB Cancer CCK
Unc GridCon GridFilter
![Page 27: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/27.jpg)
Classification Cross ValidationGiven sample data from
two classes.
Find classification function that minimizes out-of-sample estimate of classification error
( , ) 1,...,i iy i x
1
-1
![Page 28: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/28.jpg)
Lower level - SVM
' 1x w
Define parallel planes Minimize points on
wrong side Maximize margin of
separation
' 1x w
2
|| ||w
![Page 29: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/29.jpg)
Lower Level Loss Function: Hinge Loss
max 1 ,0z
Measures distance of points that violate the appropriate hyperplane constraints,ib y w x
![Page 30: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/30.jpg)
Lower Level Problem: SVC with box
2
2, ,min
2
s.t. 1
0
t tt
t
t tj
b j
t tj j t j
tj
t
y b
w ξ
w
w
w
x w
w
2
2
min max 1 , 02
tt
t
t tj j t
jb
y b
w w w
w x wR
![Page 31: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/31.jpg)
Inner-level KKT Conditions
t
,
,
, ,
0 1 0
0 1 0
0 0
0 0
0 =
0
t
t
t tj j j t j
tt tj j
t t
t t
t t t tj j j
j
tj j
j
y bj
y
y
x w
γ w w
γ w w
w x γ γ
![Page 32: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/32.jpg)
Outer-level Loss Functions Misclassification Minimization Loss (MM)
Loss function used in classical CV Loss = 1, if validation pt misclassified,
0, otherwise(computed using step function, )
Hinge Loss (HL) Both inner and outer levels use same loss
function Loss = distance from
(computed using max function, ) ib y w x
*
max ,0
![Page 33: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/33.jpg)
Hinge Loss is Convex Approx. of Misclassification Minimization Loss
max 1 ,0z
*z
![Page 34: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/34.jpg)
Hinge Loss Bilevel Program (BilevelHL)
Replace max in outer level objective with convex constraints
Replace inner-level problems with KKT conditions
,
1
2
2
1min
s.t. , arg min2
max 1 ( ),0
max 1 ( ),0
t
t
T
t it
tt
jb
ti i t
j j
T
b
y b
y b
w
w w ww w
x w
x wR
,
,
min
min s.t. 1 ( )
max 1 ( ,0 0
)t
t
tt
ti
bt ti i i
b ti i
i
tt ty b
z
z y bz
w
wxx ww
![Page 35: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/35.jpg)
Hinge Loss MPEC , , , 1
t
1min
s.t. 0, 0,
and for 1,...,
1 ( )
0
0
tt
t
Tti
b t it
t ti i i t
tti
tj j j
zT
t T
z y bi
z
y
w w
w
x w
x w
,
,
, ,
1 0
0 1 0
0 0
0 0
0 =
0
t
t
tt j
tt tj j
t t
t t
t t t tj j j
j
tj j
j
bj
y
y
γ w w
γ w w
w x γ γ
![Page 36: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/36.jpg)
Misclassification Min. Bilevel Program (BilevelMM)
,
1
2
2
1min
s.t. , arg max 1 ( )
( )
,0m n
*
i2
t
t
j
T
t it
t
b
t
j
t
t
i i
j
Ty b
bb y
w
w w w
x
www
w
xR
1, if 0,( )
* 0, if 0.n
nn
a
a
a
Misclassifications are counted using the step function, defined component
wise for a n-vector as
![Page 37: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/37.jpg)
The Step FunctionMangasarian (1994) showed that
and that any solution, , to the LP satisfies
arg min : 0 1*
ζ
a ζ a ζ
0 0
0 1 0
ζ a z
z ζ
,z ζ
![Page 38: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/38.jpg)
Misclassifications in the Validation Set
Validation point misclassified when the sign of is negative i.e.,
This can be recast for all validation points (within the t-th fold) as
( )ti i ty b x w
( ) 0*
t ti i i ty b x w
0 1
arg min ( )t
t
t t ti i i t
i
y b
ζ
ζ x w
![Page 39: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/39.jpg)
Misclassification Minimization Bilevel Program (revisited)
,1
0 1
2
2
1min
s.t. arg min ( )
, arg min max 1 ( ),02
t
t
t
Tti
t it
t ti i i t
i
tt j j
jb
T
y b
b y b
w
ζ
w w w
ζ x w
w x w wR
Inner-level problems to determine misclassified
validation points
Inner-level training problems
Outer-level average misclassification minimization
![Page 40: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/40.jpg)
Misclassification Minimization MPEC
, , , 1
t
1min
s.t. 0, 0, and for 1,...,
0 0
0 1 0
0 1 0 0 1 0
tt
t
Tti
b t it
t t ti i i t i
tt ti i
t tj j j t j
t tj j
T
t T
y b zi
z
y bj
w w
w
x w
x w
,
,
, ,
0 0
0 0 0 =
0t
t
t
t t
t t
t t t tj j j
j
tj j
j
y
y
γ w wγ w w
w x γ γ
![Page 41: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/41.jpg)
Inexact Cross Validation NLP Both BilevelHL and BilevelMM MPECs
are transformed to NLP by relaxing equilibrium constraints (inexact CV)
Solved using FILTER on NEOS These are compared with classical cross
validation: unconstrained and constrained grid.
![Page 42: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/42.jpg)
Experiments: Data sets
3-fold cross validation for model selection
Average results for 20 train test splits
![Page 43: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/43.jpg)
Computational Time
0
1000
2000
3000
4000
5000
6000
7000
8000
pima cancer heart iono bright dim
Unc-GridCon-GrdBilevel-MisMinBilevel-Hinge
![Page 44: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/44.jpg)
Training CV Error
0
5
10
15
20
25
30
35
40
45
pima cancer heart iono bright dim
Unc-GridCon-GrdBilevel-MisMinBilevel-Hinge
![Page 45: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/45.jpg)
Testing Error
0
5
10
15
20
25
pima cancer heart iono bright dim
Unc-GridCon-GrdBilevel-MisMinBilevel-Hinge
![Page 46: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/46.jpg)
Number of Variables
0
5
10
15
20
25
30
35
pima cancer heart iono bright dim
Unc-GridCon-GrdBilevel-MisMinBilevel-Hinge
![Page 47: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/47.jpg)
Progress Cross Validation is a bilevel problem
solvable by continuous optimization methods
Off-the-shelf NLP algorithm – FILTER solved classification and regression
Bilevel Optimization extendable to many Machine Learning problems
![Page 48: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/48.jpg)
Extending Bilevel Approach to other Machine Learning Problems Kernel Classification/Regression Variable Selection/Scaling Multi-task Learning Semi-supervised Learning Generative methods
![Page 49: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/49.jpg)
Semi-supervised Learning Have labeled data, and
unlabeled data Treat missing labels, , as design variables
in the outer level Lower level problems are still convex
1,i i iy
Ω x
1,u u
Ψ x m
uz
![Page 50: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/50.jpg)
Semi-supervised Regression
1
, , ,
,
2
2
min
max ,0
s.t. , arg min max ,0
1
2
n
i iC Di
j jj
u ub u
b y
C b y
b D b z
zΩ
Ω
w Ψ
x w
x w
w x w
w
Outer level minimizes error on labeled data
to find optimal parameters and labels -insensitive loss on
labeled data in inner
level
-insensitive loss on
unlabeled data in inner
level
Inner level regularization
![Page 51: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/51.jpg)
Discussion New capacity offers new possibilities: Outer level objectives? Inner level problem? classification, ranking, semi-supervised, missing values, kernel selection, variable
selection, … Need special purpose algorithms for
greater efficiency, scalability, robustnessThis work was supported by Office of Naval Research Grant N00014-06-1-0014.
![Page 52: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/52.jpg)
Experiments: Bilevel CV Procedure Run BilevelMM/BilevelHL to compute
optimal parameters, Drop descriptors with small Create model on all training data
using Compute test error on hold-out set
ˆˆ ,ww
ˆˆ ,w
ˆˆ( , )bw
( , )
1 1 ˆˆsign2
test
testytest
ERROR b y x
w x
![Page 53: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/53.jpg)
Experiments: Grid Search CV Procedure Unconstrained Grid: Try 6 values for on a log10 scale Constrained Grid: Try 6 values for and {0, 1} for each
component of (perform RFE if necessary) Create model on all training data using
optimal grid point Compute test error on hold-out set
w
ˆˆ( , )bwˆˆ ,w
( , )
1 1 ˆˆsign2
test
testytest
ERROR b y x
w x
![Page 54: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/54.jpg)
Extending Bilevel Approach to other Machine Learning Problems Kernel Classification/Regression Different Regularizations (L1, elastic nets)
Enhanced Feature Selection Multi-task Learning Semi-supervised Learning Generative methods
![Page 55: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/55.jpg)
Enhanced Feature Selection Assume at most descriptors allowed Introduce outer-level constraint
(with counting the non-zero elements of )
Rewrite constraint, observing that Get additional conditions for
maxn
0 maxnww
0
0 *w e w
1
0 00 0
n
m maxm
n
δ w dd e δ
*δ w
![Page 56: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/56.jpg)
Kernel Bilevel Discussion Pros
Performs model selection in feature space Performs feature selection in input space
Cons Highly nonlinear model Difficult to solve
![Page 57: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/57.jpg)
Kernel Classification (MPEC form)
, , , 1
1min
s.t. 0, 0, and for 1,...,
0 ( , ; ) 0
0 1 0
0 ( , ; )
tt
t
t
t
Tti
b C t it
t t ti i k k i k t i
k t
t ti i
t tj j k k j k
k
TC
t T
y y b zi
z
y y
α p
p
κ x x p
κ x x p 1 0
0 0
0t
tt j
t
t tj j
tj j
j
bj
C
y
![Page 58: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/58.jpg)
Is it okay to do 3-folds
![Page 59: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/59.jpg)
Applying the “kernel trick” Drop the box constraint, Eliminate from the optimality conditions
Replace with an appropriate
w w ww
0 0
0 1 0
0 1 0
0 0
0
t
t
t
t t ti i k k t i
k t
t ti i
t t tj j k k t j
k t
t
i k
j
tj j
tj
k
jj
y y b zi
z
y y bj
C
y
x x
x x
kx x ( , )kκ x x
![Page 60: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/60.jpg)
Feature Selection with Kernels Parameterize the kernel with
such that if the n-th descriptor vanishes from
Linear kernel
Polynomial kernel
Gaussian kernel
, 0n p p0np
( , ; )kκ x x p
( , ; ) diagd
k k c κ x x p x p x
( , ; ) diagk kκ x x p x p x
( , ; ) exp 'diagk k k κ x x p x x p x x
![Page 61: Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences](https://reader036.vdocument.in/reader036/viewer/2022062717/56649e195503460f94b06ca3/html5/thumbnails/61.jpg)
Kernel Regression (Bilevel form)
,
, ,
, , 1
, ,
,
, , , ,
1min ( , ; )
max ( , ; ) ,0
s.t. arg min1
( , ; )2
t
t t
t t
t t
Tt tk k i k i
C t i kt
t tk k j k j
j kt
t t t tk k l l l k
k l
yT
C y
α p
α
κ x x p
κ x x p
α
κ x x p