statistical learning theory - cs.rutgers.edupa336/mls16/ml-sp16-lec12.pdfstatistical learning theory...

Statistical Learning Theory

Instructor: Pranjal Awasthi

Announcements

• HW 3 out today. Due March 24.

• Final project details out on the webpage

– http://rci.rutgers.edu/~pa336/project.html

– Proposal due – March 31

– Final report due – May 2

– In class presentations – Apr 28, May 2.

http://rci.rutgers.edu/~pa336/project.html

Next few lectures• Theoretical foundations of ML

– Formally define learning.

– What can be learned from data.

– What type of guarantees can we hope to achieve.

• Big Questions

– How to generate rules that do well on observed data.

– What confidence do we have that they will do well in the future.

Basic Learning Task• 𝑦 = 𝑓∗(𝑥1, 𝑥2, … 𝑥𝑛)

– 𝑓∗ ∈ 𝐻

– 𝑋 = 𝑥1, 𝑥2, … 𝑥𝑛 ∈ ℜ𝑛

– 𝑦 = 0/1 or 𝑦 ∈ ℜ

• Training data 𝑆𝑚 = 𝑋1, 𝑦1 , 𝑋2, 𝑦2 , … (𝑋𝑚, 𝑦𝑚)

– 𝑋𝑖 ∼ 𝐷

• Prediction rule: 𝑓:ℜ𝑛 → ℜ• Error: 𝑒𝑟𝑟(𝑓)

– 𝑒𝑟𝑟 𝑓 = Pr𝐷[𝑓 𝑋 ≠ 𝑓∗(𝑋) ]

– 𝑒𝑟𝑟 𝑓 = 𝐸𝐷[ 𝑓 𝑋 − 𝑓∗ 𝑋2]

Probably Approximately Correct(PAC) model(Valiant’84)

• An algorithm A PAC learns a class H if– For any given 𝜖 > 0, 𝛿 > 0, and any learning problem in 𝐻

• A takes as input 𝑆𝑚 and produces f of error at most 𝜖with probability at least 1 − 𝛿

– Learning problem in 𝐻

• Choose 𝐷 ∼ ℜ𝑛, 𝑓∗ ∈ 𝐻

– 𝑚 should be polynomial in n,1

𝜖,1

𝛿

– Ideally, runtime should also be polynomial in 𝑚.

– 𝑓 should be computable in polynomial time


• An algorithm A PAC learns a class H if

– For any given 𝜖 > 0, 𝛿 > 0, and any learning problem in 𝐻










• Not very realistic

– Assumes 𝑦 = 𝑓∗(𝑥1, 𝑥2, … 𝑥𝑛)

– First step towards more useful extensions

– This lecture - will stick with prediction tasks 𝑦=0/1







• How to PAC learn H?

– Natural idea: fit the training data

– Empirical Risk Minimization: find 𝑓 such that 𝑒𝑟𝑟𝑆𝑚 𝑓 = 0

ERM as a universal algorithm

• Let A be the empirical risk minimizer

Theorem:For any finite H, ERM PAC learns H provided

𝑚 ≥1

𝜖log(

𝐻

𝛿)



𝑚 ≥1

𝜖log(

𝐻

𝛿)



𝑚 ≥1

𝜖log(

𝐻

𝛿)

What if H is infinite or uncountable?

1-d example

1-d example

Should be able to replace |H| with the number of“different” functions on 𝑆𝑚.


Theorem:For any H, ERM PAC learns H provided

𝑚 ≥4

𝜖log(

2𝐶[2𝑚]

𝛿)

𝐶 2𝑚 = Maximum # ways to label any set of 2𝑚 points using functions in H.

= Maximum # functions induced on a set of 2m points by 𝐻.

1-d example


𝑚 ≥4

𝜖log(

2𝐶[2𝑚]

𝛿)

2-d example


𝑚 ≥4

𝜖log(

2𝐶[2𝑚]

𝛿)



𝑚 ≥4

𝜖log(

2𝐶[2𝑚]

𝛿)

𝐶 2𝑚 = maximum # ways to label any set of 2𝑚 points using functions in H.

How to bound 𝐶[2𝑚] for a general class?

statistical learning theory - cs.rutgers.edupa336/mls16/ml-sp16-lec12.pdfstatistical learning theory...

Documents