the boosting approach to machine learning10715-f18/lectures/boosting-2018.pdfadaboost (adaptive...

32
The Boosting Approach to Machine Learning Maria-Florina Balcan 10/01/2018

Upload: others

Post on 13-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

The Boosting Approach to

Machine Learning

Maria-Florina Balcan

10/01/2018

Page 2: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Boosting

β€’ Works by creating a series of challenge datasets s.t. even modest performance on these can be used to produce an overall high-accuracy predictor.

β€’ Backed up by solid foundations.

β€’ Works well in practice (Adaboost and its variations one of the top 10 algorithms).

β€’ General method for improving the accuracy of any given learning algorithm.

Page 3: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Readings:

β€’ The Boosting Approach to Machine Learning: An Overview. Rob Schapire, 2001

β€’ Theory and Applications of Boosting. NIPS tutorial.

http://www.cs.princeton.edu/~schapire/talks/nips-tutorial.pdf

Plan for today:

β€’ Motivation.

β€’ A bit of history.

β€’ Adaboost: algo, guarantees, discussion.

β€’ Focus on supervised classification.

Page 4: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

An Example: Spam Detection

Key observation/motivation:

β€’ Easy to find rules of thumb that are often correct.

β€’ Harder to find single rule that is very highly accurate.

β€’ E.g., β€œIf buy now in the message, then predict spam.”

Not spam spam

β€’ E.g., β€œIf say good-bye to debt in the message, then predict spam.”

β€’ E.g., classify which emails are spam and which are important.

Page 5: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

β€’ Boosting: meta-procedure that takes in an algo for finding rules of thumb (weak learner). Produces a highly accurate rule, by calling the weak learner repeatedly on cleverly chosen datasets.

An Example: Spam Detection

β€’ apply weak learner to a subset of emails, obtain rule of thumb

β€’ apply to 2nd subset of emails, obtain 2nd rule of thumb

β€’ apply to 3rd subset of emails, obtain 3rd rule of thumb

β€’ repeat T times; combine weak rules into a single highly accurate rule.

π’‰πŸ

π’‰πŸ

π’‰πŸ‘

𝒉𝑻

…E

mai

ls

Page 6: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Boosting: Important Aspects

How to choose examples on each round?

How to combine rules of thumb into single prediction rule?

β€’ take (weighted) majority vote of rules of thumb

β€’ Typically, concentrate on β€œhardest” examples (those most often misclassified by previous rules of thumb)

Page 7: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Historically….

Page 8: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Weak Learning vs Strong/PAC Learning

β€’ [Kearns & Valiant ’88]: defined weak learning:being able to predict better than random guessing

(error ≀1

2βˆ’ 𝛾) , consistently.

β€’ Posed an open pb: β€œDoes there exist a boosting algo that turns a weak learner into a strong PAC learner (that can

produce arbitrarily accurate hypotheses)?”

β€’ Informally, given β€œweak” learning algo that can consistently

find classifiers of error ≀1

2βˆ’ 𝛾, a boosting algo would

provably construct a single classifier with error ≀ πœ–.

Page 9: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Weak Learning vs Strong/PAC Learning

Strong (PAC) Learning

β€’ βˆƒ algo A

β€’ βˆ€ 𝑐 ∈ 𝐻

β€’ βˆ€π·

β€’ βˆ€ πœ– > 0

β€’ βˆ€ 𝛿 > 0

β€’ A produces h s.t.:

Weak Learning

β€’ βˆƒ algo A

β€’ βˆƒπ›Ύ > 0

β€’ βˆ€ 𝑐 ∈ 𝐻

β€’ βˆ€π·

β€’ βˆ€ πœ– >1

2βˆ’ 𝛾

β€’ βˆ€ 𝛿 > 0

β€’ A produces h s.t.Pr π‘’π‘Ÿπ‘Ÿ β„Ž β‰₯ πœ– ≀ 𝛿

Pr π‘’π‘Ÿπ‘Ÿ β„Ž β‰₯ πœ– ≀ 𝛿

h

β€’ [Kearns & Valiant ’88]: defined weak learning & posed an open pb of finding a boosting algo.

Page 10: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Surprisingly….

Weak Learning =Strong (PAC) Learning

Original Construction [Schapire ’89]:

β€’ poly-time boosting algo, exploits that we can learn a little on every distribution.

β€’ A modest booster obtained via calling the weak learning algorithm on 3 distributions.

β€’ Cool conceptually and technically, not very practical.

β€’ Then amplifies the modest boost of accuracy by running this somehow recursively.

Error = 𝛽 <1

2βˆ’ 𝛾 β†’ error 3𝛽2 βˆ’ 2𝛽3

Page 11: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

An explosion of subsequent work

Page 12: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Adaboost (Adaptive Boosting)

[Freund-Schapire, JCSS’97]

Godel Prize winner 2003

β€œA Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”

Page 13: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Informal Description Adaboost

β€’ For t=1,2, … ,T

β€’ Construct Dt on {x1, …, xm}

β€’ Run A on Dt producing ht: 𝑋 β†’ {βˆ’1,1} (weak classifier)

xi ∈ 𝑋, 𝑦𝑖 ∈ π‘Œ = {βˆ’1,1}

+++

+

+++

+

- -

--

- -

--

ht

β€’ Boosting: turns a weak algo into a strong (PAC) learner.

β€’ Output Hfinal π‘₯ = sign σ𝑑=1π›Όπ‘‘β„Žπ‘‘ π‘₯

Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};

Roughly speaking Dt+1 increases weight on xi if ht incorrect on xi ; decreases it on xi if ht correct.

weak learning algo A (e.g., NaΓ―ve Bayes, decision stumps)

Ο΅t = Pxi ~Dt(ht xi β‰  yi) error of ht over Dt

Page 14: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Adaboost (Adaptive Boosting)

β€’ For t=1,2, … ,T

β€’ Construct 𝐃𝐭 on {𝐱𝟏, …, 𝒙𝐦}

β€’ Run A on Dt producing ht

Dt+1 puts half of weight on examplesxi where ht is incorrect & half onexamples where ht is correct

β€’ Weak learning algorithm A.

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e βˆ’π›Όπ‘‘ if 𝑦𝑖 = β„Žπ‘‘ π‘₯𝑖

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e 𝛼𝑑 if 𝑦𝑖 β‰  β„Žπ‘‘ π‘₯𝑖

Constructing 𝐷𝑑

[i.e., D1 𝑖 =1

π‘š]

β€’ Given Dt and ht set

𝛼𝑑 =1

2ln

1 βˆ’ πœ–π‘‘πœ–π‘‘

> 0

Final hyp: Hfinal π‘₯ = sign σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯

β€’ D1 uniform on {x1, …, xm}

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e βˆ’π›Όπ‘‘π‘¦π‘– β„Žπ‘‘ π‘₯𝑖

Page 15: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Adaboost: A toy example

Weak classifiers: vertical or horizontal half-planes (a.k.a. decision stumps)

Page 16: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Adaboost: A toy example

Page 17: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Adaboost: A toy example

Page 18: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Adaboost (Adaptive Boosting)

β€’ For t=1,2, … ,T

β€’ Construct 𝐃𝐭 on {𝐱𝟏, …, 𝒙𝐦}

β€’ Run A on Dt producing ht

Dt+1 puts half of weight on examplesxi where ht is incorrect & half onexamples where ht is correct

β€’ Weak learning algorithm A.

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e βˆ’π›Όπ‘‘ if 𝑦𝑖 = β„Žπ‘‘ π‘₯𝑖

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e 𝛼𝑑 if 𝑦𝑖 β‰  β„Žπ‘‘ π‘₯𝑖

Constructing 𝐷𝑑

[i.e., D1 𝑖 =1

π‘š]

β€’ Given Dt and ht set

𝛼𝑑 =1

2ln

1 βˆ’ πœ–π‘‘πœ–π‘‘

> 0

Final hyp: Hfinal π‘₯ = sign σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯

β€’ D1 uniform on {x1, …, xm}

𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e βˆ’π›Όπ‘‘π‘¦π‘– β„Žπ‘‘ π‘₯𝑖

Page 19: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Nice Features of Adaboost

β€’ Very general: a meta-procedure, it can use any weak learning algorithm!!!

β€’ Very fast (single pass through data each round) & simple to code, no parameters to tune.

β€’ Grounded in rich theory.

β€’ Shift in mindset: goal is now just to find classifiers a bit better than random guessing.

β€’ Relevant for big data age: quickly focuses on β€œcore difficulties”, well-suited to distributed settings, where data must be communicated efficiently [Balcan-Blum-Fine-Mansour COLT’12].

(e.g., NaΓ―ve Bayes, decision stumps)

Page 20: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Analyzing Training Error

Theorem πœ–π‘‘ = 1/2 βˆ’ 𝛾𝑑 (error of β„Žπ‘‘ over 𝐷𝑑)

π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ exp βˆ’2

𝑑

𝛾𝑑2

So, if βˆ€π‘‘, 𝛾𝑑 β‰₯ 𝛾 > 0, then π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ exp βˆ’2 𝛾2𝑇

Adaboost is adaptive

β€’ Does not need to know 𝛾 or T a priori

β€’ Can exploit 𝛾𝑑 ≫ 𝛾

The training error drops exponentially in T!!!

To get π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ πœ–, need only 𝑇 = 𝑂1

𝛾2log

1

πœ–rounds

Page 21: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Understanding the Updates & Normalization

Pr𝐷𝑑+1

𝑦𝑖 = β„Žπ‘‘ π‘₯𝑖 =

𝑖:𝑦𝑖=β„Žπ‘‘ π‘₯𝑖

𝐷𝑑 𝑖

π‘π‘‘π‘’βˆ’π›Όπ‘‘ =

1 βˆ’ πœ–π‘‘π‘π‘‘

π‘’βˆ’π›Όπ‘‘ =1 βˆ’ πœ–π‘‘π‘π‘‘

πœ–π‘‘1 βˆ’ πœ–π‘‘

=1 βˆ’ πœ–π‘‘ πœ–π‘‘π‘π‘‘

Pr𝐷𝑑+1

𝑦𝑖 β‰  β„Žπ‘‘ π‘₯𝑖 =

𝑖:π‘¦π‘–β‰ β„Žπ‘‘ π‘₯𝑖

𝐷𝑑 𝑖

𝑍𝑑𝑒𝛼𝑑 =

Probabilities are equal!

𝑍𝑑 =

𝑖:𝑦𝑖=β„Žπ‘‘ π‘₯𝑖

𝐷𝑑 𝑖 π‘’βˆ’π›Όπ‘‘π‘¦π‘– β„Žπ‘‘ π‘₯𝑖

Claim: Dt+1 puts half of the weight on xi where ht was incorrect and half of the weight on xi where ht was correct.

Recall 𝐷𝑑+1 𝑖 =𝐷𝑑 𝑖

𝑍𝑑e βˆ’π›Όπ‘‘π‘¦π‘– β„Žπ‘‘ π‘₯𝑖

= 1 βˆ’ πœ–π‘‘ π‘’βˆ’π›Όπ‘‘ + πœ–π‘‘π‘’

𝛼𝑑 = 2 πœ–π‘‘ 1 βˆ’ πœ–π‘‘

=

𝑖:𝑦𝑖=β„Žπ‘‘ π‘₯𝑖

𝐷𝑑 𝑖 π‘’βˆ’π›Όπ‘‘ +

𝑖:π‘¦π‘–β‰ β„Žπ‘‘ π‘₯𝑖

𝐷𝑑 𝑖 𝑒𝛼𝑑

πœ–π‘‘1

𝑍𝑑𝑒𝛼𝑑 = πœ–π‘‘

𝑍𝑑

1 βˆ’ πœ–π‘‘πœ–π‘‘

=πœ–π‘‘ 1 βˆ’ πœ–π‘‘

𝑍𝑑

Page 22: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

β€’ If π»π‘“π‘–π‘›π‘Žπ‘™ incorrectly classifies π‘₯𝑖,

Analyzing Training Error: Proof Intuition

- Then π‘₯𝑖 incorrectly classified by (wtd) majority of β„Žπ‘‘ ’s.

β€’ On round 𝑑, we increase weight of π‘₯𝑖 for which β„Žπ‘‘ is wrong.

- Which implies final prob. weight of π‘₯𝑖 is large.

Can show probability β‰₯1

π‘š

1

ς𝑑 𝑍𝑑

β€’ Since sum of prob. = 1, can’t have too many of high weight.

And (ς𝑑 𝑍𝑑) β†’ 0 .

Can show # incorrectly classified ≀ π‘š ς𝑑 𝑍𝑑 .

Theorem πœ–π‘‘ = 1/2 βˆ’ 𝛾𝑑 (error of β„Žπ‘‘ over 𝐷𝑑)

π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ exp βˆ’2

𝑑

𝛾𝑑2

Page 23: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

π‘š

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

ς𝑑 𝑍𝑑

where 𝑓 π‘₯𝑖 = σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯𝑖 .

Step 2: errS π»π‘“π‘–π‘›π‘Žπ‘™ ≀ ς𝑑 𝑍𝑑 .

Step 3: ς𝑑 𝑍𝑑 = ς𝑑 2 πœ–π‘‘ 1 βˆ’ πœ–π‘‘ = ς𝑑 1 βˆ’ 4𝛾𝑑2 ≀ π‘’βˆ’2 σ𝑑 𝛾𝑑

2

[Unthresholded weighted vote of β„Žπ‘– on π‘₯𝑖 ]

Page 24: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

π‘š

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

ς𝑑 𝑍𝑑

where 𝑓 π‘₯𝑖 = σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯𝑖 .

Recall 𝐷1 𝑖 =1

π‘šand 𝐷𝑑+1 𝑖 = 𝐷𝑑 𝑖

exp βˆ’π‘¦π‘–π›Όπ‘‘β„Žπ‘‘ π‘₯𝑖

𝑍𝑑

𝐷𝑇+1 𝑖 =exp βˆ’π‘¦π‘–π›Όπ‘‡β„Žπ‘‡ π‘₯𝑖

𝑍𝑇× 𝐷𝑇 𝑖

=exp βˆ’π‘¦π‘–π›Όπ‘‡β„Žπ‘‡ π‘₯𝑖

𝑍𝑇×

exp βˆ’π‘¦π‘–π›Όπ‘‡βˆ’1β„Žπ‘‡βˆ’1 π‘₯𝑖

π‘π‘‡βˆ’1Γ— π·π‘‡βˆ’1 𝑖

…… .

=exp βˆ’π‘¦π‘–π›Όπ‘‡β„Žπ‘‡ π‘₯𝑖

𝑍𝑇×⋯×

exp βˆ’π‘¦π‘–π›Ό1β„Ž1 π‘₯𝑖

𝑍1

1

π‘š

=1

π‘š

exp βˆ’π‘¦π‘–(𝛼1β„Ž1 π‘₯𝑖 +β‹―+π›Όπ‘‡β„Žπ‘‡ π‘₯𝑇 )

𝑍1⋯𝑍𝑇=

1

π‘š

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

ς𝑑 𝑍𝑑

Page 25: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

π‘š

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

ς𝑑 𝑍𝑑

where 𝑓 π‘₯𝑖 = σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯𝑖 .

errS π»π‘“π‘–π‘›π‘Žπ‘™ =1

π‘š

𝑖

1π‘¦π‘–β‰ π»π‘“π‘–π‘›π‘Žπ‘™ π‘₯𝑖

1

0

0/1 loss

exp loss

=1

π‘š

𝑖

1𝑦𝑖𝑓 π‘₯𝑖 ≀0

≀1

π‘š

𝑖

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

= ς𝑑 𝑍𝑑 .=

𝑖

𝐷𝑇+1 𝑖 ΰ·‘

𝑑

𝑍𝑑

Step 2: errS π»π‘“π‘–π‘›π‘Žπ‘™ ≀ ς𝑑 𝑍𝑑 .

Page 26: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Analyzing Training Error: Proof Math

Step 1: unwrapping recurrence: 𝐷𝑇+1 𝑖 =1

π‘š

exp βˆ’π‘¦π‘–π‘“ π‘₯𝑖

ς𝑑 𝑍𝑑

where 𝑓 π‘₯𝑖 = σ𝑑 π›Όπ‘‘β„Žπ‘‘ π‘₯𝑖 .

Step 2: errS π»π‘“π‘–π‘›π‘Žπ‘™ ≀ ς𝑑 𝑍𝑑 .

Step 3: ς𝑑 𝑍𝑑 = ς𝑑 2 πœ–π‘‘ 1 βˆ’ πœ–π‘‘ = ς𝑑 1 βˆ’ 4𝛾𝑑2 ≀ π‘’βˆ’2 σ𝑑 𝛾𝑑

2

Note: recall 𝑍𝑑 = 1 βˆ’ πœ–π‘‘ π‘’βˆ’π›Όπ‘‘ + πœ–π‘‘π‘’

𝛼𝑑 = 2 πœ–π‘‘ 1 βˆ’ πœ–π‘‘

𝛼𝑑 minimizer of 𝛼 β†’ 1 βˆ’ πœ–π‘‘ π‘’βˆ’π›Ό + πœ–π‘‘π‘’

𝛼

Page 27: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

β€’ If π»π‘“π‘–π‘›π‘Žπ‘™ incorrectly classifies π‘₯𝑖,

Analyzing Training Error: Proof Intuition

- Then π‘₯𝑖 incorrectly classified by (wtd) majority of β„Žπ‘‘ ’s.

β€’ On round 𝑑, we increase weight of π‘₯𝑖 for which β„Žπ‘‘ is wrong.

- Which implies final prob. weight of π‘₯𝑖 is large.

Can show probability β‰₯1

π‘š

1

ς𝑑 𝑍𝑑

β€’ Since sum of prob. = 1, can’t have too many of high weight.

And (ς𝑑 𝑍𝑑) β†’ 0 .

Can show # incorrectly classified ≀ π‘š ς𝑑 𝑍𝑑 .

Theorem πœ–π‘‘ = 1/2 βˆ’ 𝛾𝑑 (error of β„Žπ‘‘ over 𝐷𝑑)

π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ exp βˆ’2

𝑑

𝛾𝑑2

Page 28: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Generalization Guarantees

G={all fns of the form sign(σ𝑑=1𝑇 π›Όπ‘‘β„Žπ‘‘(π‘₯)) }

π»π‘“π‘–π‘›π‘Žπ‘™ is a weighted vote, so the hypothesis class is:

Theorem [Freund&Schapire’97]

βˆ€ 𝑔 ∈ 𝐺, π‘’π‘Ÿπ‘Ÿ 𝑔 ≀ π‘’π‘Ÿπ‘Ÿπ‘† 𝑔 + ෨𝑂𝑇𝑑

π‘šT= # of rounds

Key reason: VCdπ‘–π‘š 𝐺 = ෨𝑂 𝑑𝑇 plus typical VC bounds.

β€’ H space of weak hypotheses; d=VCdim(H)

Theorem where πœ–π‘‘ = 1/2 βˆ’ π›Ύπ‘‘π‘’π‘Ÿπ‘Ÿπ‘† π»π‘“π‘–π‘›π‘Žπ‘™ ≀ exp βˆ’2

𝑑

𝛾𝑑2

How about generalization guarantees?

Original analysis [Freund&Schapire’97]

Page 29: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Generalization GuaranteesTheorem [Freund&Schapire’97]

βˆ€ 𝑔 ∈ π‘π‘œ(𝐻), π‘’π‘Ÿπ‘Ÿ 𝑔 ≀ π‘’π‘Ÿπ‘Ÿπ‘† 𝑔 + ෨𝑂𝑇𝑑

π‘šwhere d=VCdim(H)

error

complexity

train error

generalizationerror

T= # of rounds

Page 30: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Generalization Guarantees

β€’ Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large.

β€’ Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!

Page 31: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

Generalization Guarantees

β€’ Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large.

β€’ Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!

β€’ These results seem to contradict FS’87 bound and Occam’s razor (in order achieve good test error the classifier should be as

simple as possible)!

Page 32: The Boosting Approach to Machine Learning10715-f18/lectures/boosting-2018.pdfAdaboost (Adaptive Boosting) β€’ For t=1,2, … ,T β€’ Construct 𝐃𝐭 on {𝐱 , …, 𝒙𝐦‒ Run

How can we explain the experiments?

Key Idea:

R. Schapire, Y. Freund, P. Bartlett, W. S. Lee. present in β€œBoosting the margin: A new explanation for the effectiveness of voting methods” a nice theoretical explanation.

Training error does not tell the whole story.

We need also to consider the classification confidence!!