the rate of convergence of adaboost indraneel mukherjee cynthia rudin rob schapire
Post on 19-Dec-2015
217 Views
Preview:
TRANSCRIPT
The Rate of Convergence of AdaBoost
Indraneel Mukherjee
Cynthia Rudin
Rob Schapire
AdaBoost (Freund and Schapire 97)
AdaBoost (Freund and Schapire 97)
Basic properties of AdaBoost’s convergence are still not fully understood.
Basic properties of AdaBoost’s convergence are still not fully understood.
We address one of these basic properties: convergence rates with no assumptions.
• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier
• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et
al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)
• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier
• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et
al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)
Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}
Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]
• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier
• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et
al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)
Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}
Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]
Combination: F(x)=λ1h1(x)+…+λNhN(x)
• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier
• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et
al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)
Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}
Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]
misclassific. error ≤ exponential loss
1m
1[ yiF (xi )≤0]i=1
m
∑ ≤1m
exp −yiF(xi )( )i=1
m
∑
Combination: F(x)=λ1h1(x)+…+λNhN(x)
• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier
• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et
al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)
Exponential loss:
L(λ)= 1m
exp − λ jyihj (xi )j=1
N
∑⎛
⎝⎜⎞
⎠⎟i=1
m
∑
Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}
Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]
Exponential loss:
L(λ)= 1m
exp − λ jyihj (xi )j=1
N
∑⎛
⎝⎜⎞
⎠⎟i=1
m
∑
λ1
λ2
Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}
Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]
Exponential loss:
L(λ)= 1m
exp − λ jyihj (xi )j=1
N
∑⎛
⎝⎜⎞
⎠⎟i=1
m
∑
λ1
λ2
Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}
Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]
Known:
• AdaBoost converges asymptotically to the minimum of the exponential loss (Collins et al 2002, Zhang and Yu 2005)
• Convergence rates under strong assumptions:
• “weak learning” assumption holds, hypotheses are better than random guessing (Freund and Schapire 1997, Schapire and Singer 1999)
• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)
• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.
• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).
Known:
• AdaBoost converges asymptotically to the minimum of the exponential loss (Collins et al 2002, Zhang and Yu 2005)
• Convergence rates under assumptions:
• “weak learning” assumption holds, hypotheses are better than random guessing (Freund and Schapire 1997, Schapire and Singer 1999)
• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)
• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.
• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).
Known:
• AdaBoost converges asymptotically to the minimum of the exponential loss (Collins et al 2002, Zhang and Yu 2005)
• Convergence rates under assumptions:
• “weak learning” assumption holds, hypotheses are better than random guessing (Freund and Schapire 1997, Schapire and Singer 1999)
• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)
• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.
• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).
Known:
• AdaBoost converges asymptotically to the minimum of the exponential loss (Collins et al 2002, Zhang and Yu 2005)
• Convergence rates under assumptions:
• “weak learning” assumption holds, hypotheses are better than random guessing (Freund and Schapire 1997, Schapire and Singer 1999)
• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)
• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.
• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).
Outline
• Convergence Rate 1: Convergence to a target loss“Can we get within of a ‘reference’ solution?”
• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”
Ú
Ú
Main Messages
• Usual approaches assume a finite minimizer– Much more challenging not to assume this!
• Separated two different modes of analysis– comparison to reference, comparison to optimal– different rates of convergence are possible in each
• Analysis of convergence rates often ignore the “constants”– we show they can be extremely large in the worst case
• Convergence Rate 1: Convergence to a target loss“Can we get within of a “reference” solution?”
• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”
Ú
Based on a conjecture that says...
"At iteration t, L(λ t) will be at most Ú more than that of any parameter vector of l 1-norm bounded by B
in a number of rounds that is at most a polynomial
inlogN,m, B, and 1/Ú."
radius B
λ*
"At iteration t, L(λ t) will be at most Ú more than that of any parameter vector of l 1-norm bounded by B
in a number of rounds that is at most a polynomial
inlogN,m, B, and 1/Ú."
radius B
λ*
L(λ * )
"At iteration t, L(λ t) will be at most Ú more than that of any parameter vector of l 1-norm bounded by B
in a number of rounds that is at most a polynomial
inlogN,m, B, and 1/Ú."
radius B
λ*λ t
L(λ * )
L(λ t)
"At iteration t, L(λ t) will be at most Ú more than that of any parameter vector of l 1-norm bounded by B
in a number of rounds that is at most a polynomial
inlogN,m, B, and 1/Ú."
radius B
λ*λ t
L(λ * )
ÚL(λ t)
"At iteration t, L(λ t) will be at most Ú more than that of any parameter vector of l 1-norm bounded by B
in a number of rounds that is at most a polynomial
inlogN,m, B, and 1/Ú."
radius B
λ*λ t
L(λ * )
ÚL(λ t)
This happens at:
t ≤poly logN,m,B, 1Ú( )
radius B
λ*λ t
L(λ * )
ÚL(λ t)
This happens at:
t ≤poly logN,m,B, 1Ú( )
radius B
λ*
λ t
L(λ * )
ÚL(λ t)
t ≤poly logN,m,B, 1Ú( )
This happens at:
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
poly log N ,m, B, 1
Ú( )
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
poly log N ,m, B, 1
Ú( )
Best known previous result is that it takes at mostorder rounds (Bickel et al). e
1/Ú2
Intuition behind proof of Theorem 1
• Old fact: if AdaBoost takes a large step, it makes a lot of progress:
L(λ t) ≤L(λ t−1) 1−δ t2
δ t is called the "edge." It is related to the step size.
λ1
λ2
radius B
λ*
λ t St
Rt
Rt :=lnL(λ t)−lnL(λ * )
St :=infλ
Åaλ −λ tÅa1:L(λ) ≤L(λ * ){ }
L(λ * )L(λ t)
measuresprogress
measuresdistance
• Old Fact: L(λ t) ≤L(λ t−1) 1−δ t2
If δ t's are large, we make progress.• First lemma says:
Intuition behind proof of Theorem 1
If St is small, then δ t is large.
• Old Fact: L(λ t) ≤L(λ t−1) 1−δ t2
If δ t's are large, we make progress.• First lemma says:
• Second lemma says:
• Combining:
δ t 's are large at each t (unless R t already small).
Intuition behind proof of Theorem 1
δ t ≥ Rt−13 / B3 in each round t.
St remains small (unless Rt is already small).
If St is small, then δ t is large.
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
• Dependence onÅaλ *Åa1 is necessary for many datasets.
Lemma: There are simple datasets for which the number of rounds required to achieve loss L* is at least (roughly) the norm of the smallest solution achieving loss L*
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
• Dependence onÅaλ *Åa1 is necessary for many datasets.
Lemma: There are simple datasets for which the
number of rounds required to achieve loss L* is at least
inf ÅaλÅa1:L(λ) ≤L*{ } / 2 lnm
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
• Dependence onÅaλ *Åa1 is necessary for many datasets.
Lemma: There are simple datasets for which the norm of the smallest solution achieving loss L* is exponential in the number of examples.
Lemma: There are simple datasets for which the
number of rounds required to achieve loss L* is at least
inf ÅaλÅa1:L(λ) ≤L*{ } / 2 lnm
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
• Dependence onÅaλ *Åa1 is necessary for many datasets.
Lemma: There are simple datasets for which the
number of rounds required to achieve loss L* is at least
inf ÅaλÅa1:L(λ) ≤L*{ } / 2 lnm
Lemma: There are simple datasets for which
inf ÅaλÅa1:L(λ) ≤2m+Ú{ } ≥ 2m−2 −1( )ln(1 / (3Ú))
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
Conjecture: AdaBoost achieves loss at most L(λ * )+Ú
in at most O(B2 /Ú) rounds.
Number of rounds
Loss
– (O
ptim
al L
oss)
10 100 1000 10000 1e+053e-0
6 3
e-05
3e
-04
3e-
03
3e-0
2
Rate on a Simple Dataset (Log scale)
Outline
• Convergence Rate 1: Convergence to a target loss“Can we get within of a “reference” solution?”
• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”
Ú
Ú
Theorem 2: AdaBoost reaches within Ú of the optimal loss
in at most C / Ú rounds, where C only depends on the data.
• Better dependence on than Theorem 1, actually optimal.
• Doesn’t depend on the size of the best solution within a ball
• Can’t be used to prove the conjecture because in some cases C>2m. (Mostly it’s much smaller.)
Theorem 2: AdaBoost reaches within Ú of the optimal loss
in at most C / Ú rounds, where C only depends on the data.
Ú
• Main tool is the “decomposition lemma”– Says that examples fall into 2 categories, • Zero loss set Z • Finite margin set F.
– Similar approach taken independently by (Telgarsky, 2011)
Theorem 2: AdaBoost reaches within Ú of the optimal loss
in at most C / Ú rounds, where C only depends on the data.
Theorem 2: AdaBoost reaches within Ú of the optimal loss
in at most C / Ú rounds, where C only depends on the data.
++
++
+
++
-- -
-
-
+
++
+
-
--
-
-
-
Theorem 2: AdaBoost reaches within Ú of the optimal loss
in at most C / Ú rounds, where C only depends on the data.
++
++
+
++
-- -
-
-
+
++
+
-
--
-
-
-
F
Theorem 2: AdaBoost reaches within Ú of the optimal loss
in at most C / Ú rounds, where C only depends on the data.
++
++
+
++
-- -
-
-
+
++
+
-
--
-
-
-
Z
1.) For some γ > 0, there exists vector η+, Åaη+Åa1=1 such that:
∀i ∈Z, η j+yih j (xi )
j∑ ≥ γ , (Margins are at least gamma in Z )
∀ i ∈F, η j+yih j (xi )
j∑ = 0, (Examples in F have zero margins)
2.) The optimal loss considering only examples in F is
achieved by some finite η* .
For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:
Decomposition Lemma
++
++
+
++
-- -
-
-
+
+
++---
-
-
-
margin of γ
margin of γ
η+
1.) For some γ > 0, there exists vector η+, Åaη+Åa1=1 such that:
∀i ∈Z, η j+yih j (xi )
j∑ ≥ γ , (Margins are at least gamma in Z )
∀ i ∈F, η j+yih j (xi )
j∑ = 0, (Examples in F have zero margins)
2.) The optimal loss considering only examples in F is
achieved by some finite η* .
For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:
Decomposition Lemma
++
++
+
++
-- -
-
-
+
++
+
-
--
-
-
-
F
+
++
+
-
--
-
-
-
F
+
++
+
-
--
-
-
-
F
η*
1.) For some γ > 0, there exists vector η+, Åaη+Åa1=1 such that:
∀i ∈Z, η j+yih j (xi )
j∑ ≥ γ , (Margins are at least gamma in Z )
∀ i ∈F, η j+yih j (xi )
j∑ = 0, (Examples in F have zero margins)
2.) The optimal loss considering only examples in F is
achieved by some finite η* .
For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:
Decomposition Lemma
• We provide a conjecture about dependence on m.
Lemma: There are simple datasets for which the
constant C is doubly exponential, at least 2Ω(2m/m).
Conjecture: If hypotheses are {-1,0,1}-valued, AdaBoost
converges to within Ú of the optimal loss within
2O(m ln m )Ú−1+o(1) rounds.
• This would give optimal dependence on m and simultaneously.
Ú
To summarize• Two rate bounds, one depends on the size of the
best solution within a ball and has dependence .
• The other depends only on but C can be doubly exponential in m.
• Many lower bounds and conjectures in the paper.
Ú−5
C / Ú
To summarize• Two rate bounds, one depends on the size of the
best solution within a ball and has dependence .
• The other depends only on but C can be doubly exponential in m.
• Many lower bounds and conjectures in the paper.
Ú−5
C / Ú
Thank you
• Old Fact: L(λ t) ≤L(λ t−1) 1−δ t2
If δ t's are large, we make progress.• First lemma says:
δ t 's are large whenever loss on Z is large.
Intuition behind proof of Theorem 2
δ t 's are large whenever loss on F is large.
Translates into that δ t's are large whenever loss on Z is small.
• Second lemma says:
• Old Fact: L(λ t) ≤L(λ t−1) 1−δ t2
If δ t's are large, we make progress.• First lemma says:
δ t 's are large whenever loss on Z is large.
Intuition behind proof of Theorem 2
δ t 's are large whenever loss on F is large.
Translates into that δ t's are large whenever loss on Z is small.
• Second lemma says:
• see notes
top related