the rate of convergence of adaboost indraneel mukherjee cynthia rudin rob schapire
Post on 19-Dec-2015
217 views
TRANSCRIPT
![Page 1: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/1.jpg)
The Rate of Convergence of AdaBoost
Indraneel Mukherjee
Cynthia Rudin
Rob Schapire
![Page 2: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/2.jpg)
AdaBoost (Freund and Schapire 97)
![Page 3: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/3.jpg)
AdaBoost (Freund and Schapire 97)
![Page 4: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/4.jpg)
Basic properties of AdaBoost’s convergence are still not fully understood.
![Page 5: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/5.jpg)
Basic properties of AdaBoost’s convergence are still not fully understood.
We address one of these basic properties: convergence rates with no assumptions.
![Page 6: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/6.jpg)
• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier
• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et
al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)
![Page 7: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/7.jpg)
• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier
• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et
al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)
Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}
Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]
![Page 8: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/8.jpg)
• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier
• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et
al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)
Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}
Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]
Combination: F(x)=λ1h1(x)+…+λNhN(x)
![Page 9: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/9.jpg)
• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier
• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et
al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)
Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}
Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]
misclassific. error ≤ exponential loss
1m
1[ yiF (xi )≤0]i=1
m
∑ ≤1m
exp −yiF(xi )( )i=1
m
∑
Combination: F(x)=λ1h1(x)+…+λNhN(x)
![Page 10: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/10.jpg)
• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier
• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et
al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)
Exponential loss:
L(λ)= 1m
exp − λ jyihj (xi )j=1
N
∑⎛
⎝⎜⎞
⎠⎟i=1
m
∑
Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}
Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]
![Page 11: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/11.jpg)
Exponential loss:
L(λ)= 1m
exp − λ jyihj (xi )j=1
N
∑⎛
⎝⎜⎞
⎠⎟i=1
m
∑
λ1
λ2
Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}
Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]
![Page 12: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/12.jpg)
Exponential loss:
L(λ)= 1m
exp − λ jyihj (xi )j=1
N
∑⎛
⎝⎜⎞
⎠⎟i=1
m
∑
λ1
λ2
Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}
Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]
![Page 13: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/13.jpg)
Known:
• AdaBoost converges asymptotically to the minimum of the exponential loss (Collins et al 2002, Zhang and Yu 2005)
• Convergence rates under strong assumptions:
• “weak learning” assumption holds, hypotheses are better than random guessing (Freund and Schapire 1997, Schapire and Singer 1999)
• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)
• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.
• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).
![Page 14: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/14.jpg)
Known:
• AdaBoost converges asymptotically to the minimum of the exponential loss (Collins et al 2002, Zhang and Yu 2005)
• Convergence rates under assumptions:
• “weak learning” assumption holds, hypotheses are better than random guessing (Freund and Schapire 1997, Schapire and Singer 1999)
• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)
• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.
• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).
![Page 15: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/15.jpg)
Known:
• AdaBoost converges asymptotically to the minimum of the exponential loss (Collins et al 2002, Zhang and Yu 2005)
• Convergence rates under assumptions:
• “weak learning” assumption holds, hypotheses are better than random guessing (Freund and Schapire 1997, Schapire and Singer 1999)
• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)
• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.
• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).
![Page 16: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/16.jpg)
Known:
• AdaBoost converges asymptotically to the minimum of the exponential loss (Collins et al 2002, Zhang and Yu 2005)
• Convergence rates under assumptions:
• “weak learning” assumption holds, hypotheses are better than random guessing (Freund and Schapire 1997, Schapire and Singer 1999)
• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)
• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.
• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).
![Page 17: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/17.jpg)
Outline
• Convergence Rate 1: Convergence to a target loss“Can we get within of a ‘reference’ solution?”
• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”
Ú
Ú
![Page 18: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/18.jpg)
Main Messages
• Usual approaches assume a finite minimizer– Much more challenging not to assume this!
• Separated two different modes of analysis– comparison to reference, comparison to optimal– different rates of convergence are possible in each
• Analysis of convergence rates often ignore the “constants”– we show they can be extremely large in the worst case
![Page 19: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/19.jpg)
• Convergence Rate 1: Convergence to a target loss“Can we get within of a “reference” solution?”
• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”
Ú
Based on a conjecture that says...
![Page 20: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/20.jpg)
"At iteration t, L(λ t) will be at most Ú more than that of any parameter vector of l 1-norm bounded by B
in a number of rounds that is at most a polynomial
inlogN,m, B, and 1/Ú."
![Page 21: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/21.jpg)
radius B
λ*
"At iteration t, L(λ t) will be at most Ú more than that of any parameter vector of l 1-norm bounded by B
in a number of rounds that is at most a polynomial
inlogN,m, B, and 1/Ú."
![Page 22: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/22.jpg)
radius B
λ*
L(λ * )
"At iteration t, L(λ t) will be at most Ú more than that of any parameter vector of l 1-norm bounded by B
in a number of rounds that is at most a polynomial
inlogN,m, B, and 1/Ú."
![Page 23: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/23.jpg)
radius B
λ*λ t
L(λ * )
L(λ t)
"At iteration t, L(λ t) will be at most Ú more than that of any parameter vector of l 1-norm bounded by B
in a number of rounds that is at most a polynomial
inlogN,m, B, and 1/Ú."
![Page 24: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/24.jpg)
radius B
λ*λ t
L(λ * )
ÚL(λ t)
"At iteration t, L(λ t) will be at most Ú more than that of any parameter vector of l 1-norm bounded by B
in a number of rounds that is at most a polynomial
inlogN,m, B, and 1/Ú."
![Page 25: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/25.jpg)
radius B
λ*λ t
L(λ * )
ÚL(λ t)
This happens at:
t ≤poly logN,m,B, 1Ú( )
![Page 26: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/26.jpg)
radius B
λ*λ t
L(λ * )
ÚL(λ t)
This happens at:
t ≤poly logN,m,B, 1Ú( )
![Page 27: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/27.jpg)
radius B
λ*
λ t
L(λ * )
ÚL(λ t)
t ≤poly logN,m,B, 1Ú( )
This happens at:
![Page 28: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/28.jpg)
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
![Page 29: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/29.jpg)
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
poly log N ,m, B, 1
Ú( )
![Page 30: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/30.jpg)
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
poly log N ,m, B, 1
Ú( )
Best known previous result is that it takes at mostorder rounds (Bickel et al). e
1/Ú2
![Page 31: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/31.jpg)
Intuition behind proof of Theorem 1
• Old fact: if AdaBoost takes a large step, it makes a lot of progress:
L(λ t) ≤L(λ t−1) 1−δ t2
δ t is called the "edge." It is related to the step size.
λ1
λ2
![Page 32: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/32.jpg)
radius B
λ*
λ t St
Rt
Rt :=lnL(λ t)−lnL(λ * )
St :=infλ
Åaλ −λ tÅa1:L(λ) ≤L(λ * ){ }
L(λ * )L(λ t)
measuresprogress
measuresdistance
![Page 33: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/33.jpg)
• Old Fact: L(λ t) ≤L(λ t−1) 1−δ t2
If δ t's are large, we make progress.• First lemma says:
Intuition behind proof of Theorem 1
If St is small, then δ t is large.
![Page 34: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/34.jpg)
• Old Fact: L(λ t) ≤L(λ t−1) 1−δ t2
If δ t's are large, we make progress.• First lemma says:
• Second lemma says:
• Combining:
δ t 's are large at each t (unless R t already small).
Intuition behind proof of Theorem 1
δ t ≥ Rt−13 / B3 in each round t.
St remains small (unless Rt is already small).
If St is small, then δ t is large.
![Page 35: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/35.jpg)
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
• Dependence onÅaλ *Åa1 is necessary for many datasets.
Lemma: There are simple datasets for which the number of rounds required to achieve loss L* is at least (roughly) the norm of the smallest solution achieving loss L*
![Page 36: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/36.jpg)
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
• Dependence onÅaλ *Åa1 is necessary for many datasets.
Lemma: There are simple datasets for which the
number of rounds required to achieve loss L* is at least
inf ÅaλÅa1:L(λ) ≤L*{ } / 2 lnm
![Page 37: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/37.jpg)
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
• Dependence onÅaλ *Åa1 is necessary for many datasets.
Lemma: There are simple datasets for which the norm of the smallest solution achieving loss L* is exponential in the number of examples.
Lemma: There are simple datasets for which the
number of rounds required to achieve loss L* is at least
inf ÅaλÅa1:L(λ) ≤L*{ } / 2 lnm
![Page 38: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/38.jpg)
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
• Dependence onÅaλ *Åa1 is necessary for many datasets.
Lemma: There are simple datasets for which the
number of rounds required to achieve loss L* is at least
inf ÅaλÅa1:L(λ) ≤L*{ } / 2 lnm
Lemma: There are simple datasets for which
inf ÅaλÅa1:L(λ) ≤2m+Ú{ } ≥ 2m−2 −1( )ln(1 / (3Ú))
![Page 39: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/39.jpg)
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
![Page 40: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/40.jpg)
Theorem 1: For any λ * ∈° N , AdaBoost achieves loss
at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.
Conjecture: AdaBoost achieves loss at most L(λ * )+Ú
in at most O(B2 /Ú) rounds.
![Page 41: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/41.jpg)
Number of rounds
Loss
– (O
ptim
al L
oss)
10 100 1000 10000 1e+053e-0
6 3
e-05
3e
-04
3e-
03
3e-0
2
Rate on a Simple Dataset (Log scale)
![Page 42: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/42.jpg)
Outline
• Convergence Rate 1: Convergence to a target loss“Can we get within of a “reference” solution?”
• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”
Ú
Ú
![Page 43: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/43.jpg)
Theorem 2: AdaBoost reaches within Ú of the optimal loss
in at most C / Ú rounds, where C only depends on the data.
![Page 44: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/44.jpg)
• Better dependence on than Theorem 1, actually optimal.
• Doesn’t depend on the size of the best solution within a ball
• Can’t be used to prove the conjecture because in some cases C>2m. (Mostly it’s much smaller.)
Theorem 2: AdaBoost reaches within Ú of the optimal loss
in at most C / Ú rounds, where C only depends on the data.
Ú
![Page 45: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/45.jpg)
• Main tool is the “decomposition lemma”– Says that examples fall into 2 categories, • Zero loss set Z • Finite margin set F.
– Similar approach taken independently by (Telgarsky, 2011)
Theorem 2: AdaBoost reaches within Ú of the optimal loss
in at most C / Ú rounds, where C only depends on the data.
![Page 46: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/46.jpg)
Theorem 2: AdaBoost reaches within Ú of the optimal loss
in at most C / Ú rounds, where C only depends on the data.
++
++
+
++
-- -
-
-
+
++
+
-
--
-
-
-
![Page 47: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/47.jpg)
Theorem 2: AdaBoost reaches within Ú of the optimal loss
in at most C / Ú rounds, where C only depends on the data.
++
++
+
++
-- -
-
-
+
++
+
-
--
-
-
-
F
![Page 48: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/48.jpg)
Theorem 2: AdaBoost reaches within Ú of the optimal loss
in at most C / Ú rounds, where C only depends on the data.
++
++
+
++
-- -
-
-
+
++
+
-
--
-
-
-
Z
![Page 49: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/49.jpg)
1.) For some γ > 0, there exists vector η+, Åaη+Åa1=1 such that:
∀i ∈Z, η j+yih j (xi )
j∑ ≥ γ , (Margins are at least gamma in Z )
∀ i ∈F, η j+yih j (xi )
j∑ = 0, (Examples in F have zero margins)
2.) The optimal loss considering only examples in F is
achieved by some finite η* .
For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:
Decomposition Lemma
![Page 50: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/50.jpg)
++
++
+
++
-- -
-
-
+
+
++---
-
-
-
margin of γ
margin of γ
η+
![Page 51: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/51.jpg)
1.) For some γ > 0, there exists vector η+, Åaη+Åa1=1 such that:
∀i ∈Z, η j+yih j (xi )
j∑ ≥ γ , (Margins are at least gamma in Z )
∀ i ∈F, η j+yih j (xi )
j∑ = 0, (Examples in F have zero margins)
2.) The optimal loss considering only examples in F is
achieved by some finite η* .
For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:
Decomposition Lemma
![Page 52: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/52.jpg)
++
++
+
++
-- -
-
-
+
++
+
-
--
-
-
-
F
![Page 53: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/53.jpg)
+
++
+
-
--
-
-
-
F
![Page 54: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/54.jpg)
+
++
+
-
--
-
-
-
F
η*
![Page 55: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/55.jpg)
1.) For some γ > 0, there exists vector η+, Åaη+Åa1=1 such that:
∀i ∈Z, η j+yih j (xi )
j∑ ≥ γ , (Margins are at least gamma in Z )
∀ i ∈F, η j+yih j (xi )
j∑ = 0, (Examples in F have zero margins)
2.) The optimal loss considering only examples in F is
achieved by some finite η* .
For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:
Decomposition Lemma
![Page 56: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/56.jpg)
• We provide a conjecture about dependence on m.
Lemma: There are simple datasets for which the
constant C is doubly exponential, at least 2Ω(2m/m).
Conjecture: If hypotheses are {-1,0,1}-valued, AdaBoost
converges to within Ú of the optimal loss within
2O(m ln m )Ú−1+o(1) rounds.
• This would give optimal dependence on m and simultaneously.
Ú
![Page 57: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/57.jpg)
To summarize• Two rate bounds, one depends on the size of the
best solution within a ball and has dependence .
• The other depends only on but C can be doubly exponential in m.
• Many lower bounds and conjectures in the paper.
Ú−5
C / Ú
![Page 58: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/58.jpg)
To summarize• Two rate bounds, one depends on the size of the
best solution within a ball and has dependence .
• The other depends only on but C can be doubly exponential in m.
• Many lower bounds and conjectures in the paper.
Ú−5
C / Ú
Thank you
![Page 59: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/59.jpg)
• Old Fact: L(λ t) ≤L(λ t−1) 1−δ t2
If δ t's are large, we make progress.• First lemma says:
δ t 's are large whenever loss on Z is large.
Intuition behind proof of Theorem 2
δ t 's are large whenever loss on F is large.
Translates into that δ t's are large whenever loss on Z is small.
• Second lemma says:
![Page 60: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/60.jpg)
• Old Fact: L(λ t) ≤L(λ t−1) 1−δ t2
If δ t's are large, we make progress.• First lemma says:
δ t 's are large whenever loss on Z is large.
Intuition behind proof of Theorem 2
δ t 's are large whenever loss on F is large.
Translates into that δ t's are large whenever loss on Z is small.
• Second lemma says:
![Page 61: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire](https://reader030.vdocument.in/reader030/viewer/2022032800/56649d2f5503460f94a06756/html5/thumbnails/61.jpg)
• see notes