large-scale svm optimization: taking a machine …shais/talks/nec08.pdflarge-scale svm optimization:...

33
Large-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological Institute at Chicago Joint work with Nati Srebro Talk at NEC Labs, Princeton, August, 2008 Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 1 / 25

Upload: ngodung

Post on 19-Mar-2018

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Large-Scale SVM Optimization:Taking a Machine Learning Perspective

Shai Shalev-Shwartz

Toyota Technological Institute at Chicago

Joint work with Nati Srebro

Talk at NEC Labs, Princeton, August, 2008

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 1 / 25

Page 2: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Motivation

10k training examples 1 hour 2.3% error

1M training examples 1 week 2.29% error

Can always sub-sample and get error of 2.3% using 1 hour

Can we leverage excess data to reduce runtime ?Say, achieve error of 2.3% using 10 minutes ?

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 2 / 25

Page 3: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Motivation

10k training examples 1 hour 2.3% error

1M training examples 1 week 2.29% error

Can always sub-sample and get error of 2.3% using 1 hour

Can we leverage excess data to reduce runtime ?Say, achieve error of 2.3% using 10 minutes ?

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 2 / 25

Page 4: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Motivation

10k training examples 1 hour 2.3% error

1M training examples 1 week 2.29% error

Can always sub-sample and get error of 2.3% using 1 hour

Can we leverage excess data to reduce runtime ?Say, achieve error of 2.3% using 10 minutes ?

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 2 / 25

Page 5: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Motivation

10k training examples 1 hour 2.3% error

1M training examples 1 week 2.29% error

Can always sub-sample and get error of 2.3% using 1 hour

Can we leverage excess data to reduce runtime ?Say, achieve error of 2.3% using 10 minutes ?

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 2 / 25

Page 6: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Outline

Background: Machine Learning, Support Vector Machine (SVM)

SVM as an optimization problem

A Machine Learning Perspective on SVM Optimization

Approximated optimizationRe-define quality of optimization using generalization errorError decompositionData-Laden Analysis

Stochastic Methods

Why Stochastic ?PEGASOS (Stochastic Gradient Descent)Stochastic Coordinate Dual Ascent

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 3 / 25

Page 7: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Background: Machine Learning and SVM

Hypothesis set HLoss function

Learning rule

Learning AlgorithmTraining Set(xi, yi)mi=1

Output h :X → Y

Support Vector Machine

Linear hypotheses: hw(x) = 〈w,x〉Prefer hypotheses with large margin, i.e., low Euclidean normResulting learning rule:

argminw

λ

2‖w‖2 +

1m

m∑i=1

max0, 1− yi 〈w,xi〉︸ ︷︷ ︸Hinge−loss

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 4 / 25

Page 8: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Support Vector Machines and Optimization

SVM learning rule:

argminw

λ

2‖w‖2 +

1m

m∑i=1

max0, 1− yi 〈w,xi〉

SVM optimization problem can be written as a QuadraticProgramming problem

argminw,ξ

λ

2‖w‖2 +

1m

m∑i=1

ξi

s.t. ∀i, 1− yi 〈w,xi〉 ≤ ξi ∧ ξi ≥ 0

Standard solvers exist. End of story ?

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 5 / 25

Page 9: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Approximated Optimization

If we don’t have infinite computation power, we can onlyapproximately solve the SVM optimization problem

Traditional analysisSVM objective:

P (w) =λ

2‖w‖2 +

1m

m∑i=1

`(〈w,xi〉 , yi)

w is ρ-accurate solution if

P (w) ≤ minw

P (w) + ρ

Main focus: How optimization runtime depends on ρ ? E.g. IPmethods converge in time O(m3.5 log(log( 1

ρ )))

Large-scale problems: How optimization runtime depends on m ?E.g. SMO converges in time O(m2 log( 1

ρ ))SVM-Perf runtime is O( mλρ )

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 6 / 25

Page 10: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Machine Learning Perspective on Optimization

Our real goal is not to solve the SVM problem P (w)Our goal is to find w with low generalization error:

L(w) = E(x,y)∼P[`(〈w,x〉 , y)]

Redefine approximated accuracy:

w is ε-accurate solution w.r.t. margin parameter B if

L(w) ≤ minw:‖w‖≤B

L(w) + ε

Study runtime as a function of ε and B

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 7 / 25

Page 11: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Error Decomposition

Theorem (S, Srebro ’08)

If w satisfiesP (w) ≤ min

wP (w) + ρ

then, w.p. at least 1− δ over choice of training set, w satisfies

L(w) ≤ minw:‖w‖≤B

L(w) + ε

with

ε =λB2

2+

c log(1/δ)λm

+ 2 ρ

(Following:

Bottou and Bousquet, “The Tradeoffs of Large Scale Learning”, NIPS ’08)

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 8 / 25

Page 12: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

More Data ⇒ Less Work ?

approximation

estimation

optimization

L(w)

L(w)

m

When data set size increases:Can increase ρ ⇒ can optimize less accurately ⇒ runtime decreasesBut handling more data may be expensive ⇒ runtime increases

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 9 / 25

Page 13: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

More Data ⇒ Less Work ?

approximation

estimation

optimization

L(w)L(w)

m

When data set size increases:Can increase ρ ⇒ can optimize less accurately ⇒ runtime decreasesBut handling more data may be expensive ⇒ runtime increases

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 9 / 25

Page 14: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

More Data ⇒ Less Work ?

approximation

estimation

optimization

L(w)L(w)

m

When data set size increases:Can increase ρ ⇒ can optimize less accurately ⇒ runtime decreasesBut handling more data may be expensive ⇒ runtime increases

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 9 / 25

Page 15: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Machine Learning Analysis of Optimization Algorithms

Given solver with opt. accuracy ρ(T,m, λ)To ensure excess generalization error ≤ ε we need that

minλ

λB2

2+c log(1/δ)λm

+ 2 ρ(T,m, λ) ≤ ε

From the above we get runtime T as a function of m,B, ε

Examples (ignoring logarithmic terms and constants, and assuminglinear kernels):

ρ(T,m, λ) T (m,B, ε)SMO (Platt ’98) exp(−T/m2)

(Bε

)4SVM-Perf (Joachims ’06)

mλT

(Bε

)4SGD (S, Srbero, Singer ’07)

1λT

(Bε

)2

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 10 / 25

Page 16: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Machine Learning Analysis of Optimization Algorithms

Given solver with opt. accuracy ρ(T,m, λ)To ensure excess generalization error ≤ ε we need that

minλ

λB2

2+c log(1/δ)λm

+ 2 ρ(T,m, λ) ≤ ε

From the above we get runtime T as a function of m,B, ε

Examples (ignoring logarithmic terms and constants, and assuminglinear kernels):

ρ(T,m, λ) T (m,B, ε)SMO (Platt ’98) exp(−T/m2)

(Bε

)4SVM-Perf (Joachims ’06)

mλT

(Bε

)4SGD (S, Srbero, Singer ’07)

1λT

(Bε

)2Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 10 / 25

Page 17: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Stochastic Gradient Descent (Pegasos)

Initialize w1 = 0For t = 1, 2, . . . , T

Choose i ∈ [m] uniformly at randomDefine

∇t = λwt − I[yt〈wt,xt〉>0] yt xt

Note: E[∇t] is a sub-gradient of P (w) at wt

Set ηt = 1λ t

Update:

wt+1 = wt − ηt∇t = (1− 1t )wt +

1λ t

I[yt〈wt,xt〉>0] yt xt

Theorem (Pegasos Convergence)

E[ρ] ≤ O

(log(T )λT

)Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 11 / 25

Page 18: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Dependence on Data Set Size

Corollary (Pegasos generalization analysis)

T (m; ε, B) = O

1(εB −

1√m

)2

Theoretical Empirical (CCAT)

Ru

nti

me

Training Set Size

sam

ple

com

ple

xity

data-laden300,000 500,000 700,000

0

2.5

5

7.5

10

12.5

15

Mill

ion Ite

rations (!

runtim

e)

Training Set Size

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 12 / 25

Page 19: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Intermediate Summary

Analyze runtime (T ) as a function of

excess generalization error (ε)size of competing class (B)

Up to constants and logarithmic terms, stochastic gradient descent(Pegasos) is optimal – its runtime is order of sample complexity

Ω((

)2)For Pegasos, running time decreases as training set size increases

Coming next

Limitations of PegasosDual Coordinate Ascent methods

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 13 / 25

Page 20: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Limitations of Pegasos

Pegasos is simple and efficient optimization method. However, it has somelimitations:

log(sample complexity) factor in convergence rate

No clear stopping criterion

Tricky to obtain a good single solution with high confidence

Too aggressive at the beginning (especially when λ very small)

When working with kernels, too much support vectors

Hsieh et al recently argued that empirically dual coordinate ascentoutperforms Pegasos

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 14 / 25

Page 21: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Dual Methods

The dual SVM problem:

minα∈[0,1]m

D(α) where D(α) =1m

m∑i=1

αi −1

2λm2‖∑i

αiyixi‖2

Decomposition Methods

Dual problem has a different variable for each example

⇒ can optimize over subset of variables at each iteration

Extreme case

Dual Coordinate Ascent (DCA) – optimize D w.r.t. a single variable ateach iterationSMO – optimize over 2 variables (necessary when having a bias term)

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 15 / 25

Page 22: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Linear convergence for decomposition methods

General convergence theory of (Luo and Tseng ’92) implies linearconvergence

But, dependence on m is quadratic. Therefore

T = O(m2 log(1/ρ))

This implies the Machine Learning analysis

T = O(B4/ε4)

Why SGD is much better than decomposition methods ?

Primal vs. dual ?Stochastic ?

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 16 / 25

Page 23: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Stochastic Dual Coordinate Ascent

The stochastic DCA algorithm

Initialize α = (0, . . . , 0) and w = 0For t = 1, 2, . . . , T

Choose i ∈ [m] uniformly at randomUpdate: αi = αi + τi where

τi = max−αi , min

1− αi, λm (1−yi〈w,xi〉)

‖xi‖2

Update: w = w + τi

λm yixi

Hsieh et al showed encouraging empirical results

No satisfactory theoretical guarantee

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 17 / 25

Page 24: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Analysis of stochastic DCA

Theorem (S ’08)

With probability at least 1− δ, the accuracy of stochastic DCA satisfies

ρ ≤ 8 ln(1/δ)T

(1λ

+m

)

Proof idea:

Let α? be optimal dual solution

Upper bound dual sub-optimality at round t by the double potential

12λm

Ei[‖αt −α?‖2 − ‖αt+1 −α?‖2

]+ Ei

[D(αt+1)−D(αt)

]Sum over t, use telescoping, and bound the result using weak-duality

Use approximated duality theory (Scovel, Hush, Steinwart ’08)

Finally, use measure concentration techniques

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 18 / 25

Page 25: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Analysis of stochastic DCA

Theorem (S ’08)

With probability at least 1− δ, the accuracy of stochastic DCA satisfies

ρ ≤ 8 ln(1/δ)T

(1λ

+m

)Proof idea:

Let α? be optimal dual solution

Upper bound dual sub-optimality at round t by the double potential

12λm

Ei[‖αt −α?‖2 − ‖αt+1 −α?‖2

]+ Ei

[D(αt+1)−D(αt)

]Sum over t, use telescoping, and bound the result using weak-duality

Use approximated duality theory (Scovel, Hush, Steinwart ’08)

Finally, use measure concentration techniques

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 18 / 25

Page 26: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Comparing SGD and DCA

SGD : ρ(m,T, λ) ≤ 1T

log(T )λ

DCA : ρ(m,T, λ) ≤ 1T

(1λ

+m

)Conclusion: Relative performance depends on λm

?< log(T )

10−8

10−6

10−4

10−2

10−6

10−4

10−2

100

102

λ

ε acc

CCAT

SGDDCA

10−8

10−6

10−4

10−2

10−6

10−4

10−2

100

102

λ

ε acc

cov1

SGDDCA

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 19 / 25

Page 27: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Combining SGD and DCA ?

The above graphs raise the natural question: Can we somehowcombine SGD and DCA ?

Seemingly, this is impossible as SGD is a primal algorithm while DCAis a dual algorithm

Interestingly, SGD can be viewed also as a dual algorithm, but with adual function that changes along the optimization process

This is an ongoing direction ...

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 20 / 25

Page 28: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Machine Learning analysis of DCA

So far, we compared SGD and DCA using the old way (ρ)

But, what about runtime as a function of ε and B ?

Similarly to previous derivation (and ignoring log terms)

SGD : T ≤ B2

ε2

DCA : T ≤ B2

ε3

Is this really the case ?

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 21 / 25

Page 29: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

SGD vs. DCA – Machine Learning Perspective

100

101

102

0.13

0.132

0.134

0.136

0.138

0.14

0.142

0.144

0.146

0.148

runtime (epochs)

Hin

ge−

loss

CCAT

SGDDCA

100

101

102

0.534

0.536

0.538

0.54

0.542

0.544

0.546

0.548

runtime (epochs)

Hin

ge−

loss

cov1

SGDDCA

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 22 / 25

Page 30: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

SGD vs. DCA – Machine Learning Perspective

100

101

102

0.13

0.132

0.134

0.136

0.138

0.14

0.142

0.144

0.146

0.148

runtime (epochs)

Hin

ge−

loss

CCAT

SGDDCA

100

101

102

0.05

0.051

0.052

0.053

0.054

0.055

runtime (epochs)

0−1

loss

CCAT

SGDDCA

100

101

102

0.534

0.536

0.538

0.54

0.542

0.544

0.546

0.548

runtime (epochs)

Hin

ge−

loss

cov1

SGDDCA

100

101

102

0.225

0.226

0.227

0.228

0.229

0.23

0.231

0.232

runtime (epochs)

0−1

loss

cov1

SGDDCA

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 22 / 25

Page 31: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Analysis of DCA revisited

DCA analysis T ≤ 1λ ρ + m

ρ

First term is like in SGD while second term involves training set size.This is necessary since each dual variable has only 1/m effect on w.

However, a more delicate analysis is possible:

Theorem (DCA refined analysis)

If T ≥ m then with high probability at least one of the following holdstrue:

After a single epoch DCA satisfies L(w) ≤ minw:‖w‖≤B

L(w)

DCA converges in time ρ ≤ c

T −m

(1λ

+ λmB2 +B√m

)The above theorem implies T ≤ O(B2/ε2).

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 23 / 25

Page 32: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Discussion

Bottou and Bousquet initiated a study of approximated optimizationfrom the perspective of generalization error

We further develop this idea

Regularized loss (like SVM)Comparing algorithms based on runtime for achieving certaingeneralization errorComparing algorithms in the data-laden regimeMore data ⇒ less work

Two stochastic approaches are close to optimal

Best methods are extremely simple :-)

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 24 / 25

Page 33: Large-Scale SVM Optimization: Taking a Machine …shais/talks/NEC08.pdfLarge-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological

Limitations and Open Problems

Analysis is based on upper bounds of estimation and optimizationerror

The online-to-batch analysis gives the same bounds for one epochover the data (No theoretical explanation when we need more thanone pass)

We assume constant runtime for each inner product evaluation (holdsfor linear kernels). How to deal with non-linear kernels ?

Sampling ?Smart selection (online learning on a budget ? Clustering ?)

We assume λ is optimally chosen. Incorporating the runtime of tuningλ in the analysis ?

Assumptions on distribution (e.g. Noise conditions)?⇒ Better analysis

A more general theory of optimization from a machine learningperspective

Shai Shalev-Shwartz (TTI-C) SVM from ML Perspective Aug’08 25 / 25