statistical learning theory - cs.rutgers.edupa336/mls16/ml-sp16-lec12.pdfstatistical learning theory...
TRANSCRIPT
![Page 1: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/1.jpg)
Statistical Learning Theory
Instructor: Pranjal Awasthi
![Page 2: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/2.jpg)
Announcements
• HW 3 out today. Due March 24.
• Final project details out on the webpage
– http://rci.rutgers.edu/~pa336/project.html
– Proposal due – March 31
– Final report due – May 2
– In class presentations – Apr 28, May 2.
![Page 3: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/3.jpg)
Next few lectures• Theoretical foundations of ML
– Formally define learning.
– What can be learned from data.
– What type of guarantees can we hope to achieve.
• Big Questions
– How to generate rules that do well on observed data.
– What confidence do we have that they will do well in the future.
![Page 4: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/4.jpg)
Basic Learning Task• 𝑦 = 𝑓∗(𝑥1, 𝑥2, … 𝑥𝑛)
– 𝑓∗ ∈ 𝐻
– 𝑋 = 𝑥1, 𝑥2, … 𝑥𝑛 ∈ ℜ𝑛
– 𝑦 = 0/1 or 𝑦 ∈ ℜ
• Training data 𝑆𝑚 = 𝑋1, 𝑦1 , 𝑋2, 𝑦2 , … (𝑋𝑚, 𝑦𝑚)
– 𝑋𝑖 ∼ 𝐷
• Prediction rule: 𝑓:ℜ𝑛 → ℜ• Error: 𝑒𝑟𝑟(𝑓)
– 𝑒𝑟𝑟 𝑓 = Pr𝐷[𝑓 𝑋 ≠ 𝑓∗(𝑋) ]
– 𝑒𝑟𝑟 𝑓 = 𝐸𝐷[ 𝑓 𝑋 − 𝑓∗ 𝑋2]
![Page 5: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/5.jpg)
Probably Approximately Correct(PAC) model(Valiant’84)
• An algorithm A PAC learns a class H if– For any given 𝜖 > 0, 𝛿 > 0, and any learning problem in 𝐻
• A takes as input 𝑆𝑚 and produces f of error at most 𝜖with probability at least 1 − 𝛿
– Learning problem in 𝐻
• Choose 𝐷 ∼ ℜ𝑛, 𝑓∗ ∈ 𝐻
– 𝑚 should be polynomial in n,1
𝜖,1
𝛿
– Ideally, runtime should also be polynomial in 𝑚.
– 𝑓 should be computable in polynomial time
![Page 6: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/6.jpg)
Probably Approximately Correct(PAC) model(Valiant’84)
• An algorithm A PAC learns a class H if
– For any given 𝜖 > 0, 𝛿 > 0, and any learning problem in 𝐻
• A takes as input 𝑆𝑚 and produces f of error at most 𝜖with probability at least 1 − 𝛿
– Learning problem in 𝐻
• Choose 𝐷 ∼ ℜ𝑛, 𝑓∗ ∈ 𝐻
![Page 7: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/7.jpg)
Probably Approximately Correct(PAC) model(Valiant’84)
• An algorithm A PAC learns a class H if
– For any given 𝜖 > 0, 𝛿 > 0, and any learning problem in 𝐻
• A takes as input 𝑆𝑚 and produces f of error at most 𝜖with probability at least 1 − 𝛿
– Learning problem in 𝐻
• Choose 𝐷 ∼ ℜ𝑛, 𝑓∗ ∈ 𝐻
• Not very realistic
– Assumes 𝑦 = 𝑓∗(𝑥1, 𝑥2, … 𝑥𝑛)
– First step towards more useful extensions
– This lecture - will stick with prediction tasks 𝑦=0/1
![Page 8: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/8.jpg)
Probably Approximately Correct(PAC) model(Valiant’84)
• An algorithm A PAC learns a class H if
– For any given 𝜖 > 0, 𝛿 > 0, and any learning problem in 𝐻
• A takes as input 𝑆𝑚 and produces f of error at most 𝜖with probability at least 1 − 𝛿
– Learning problem in 𝐻
• Choose 𝐷 ∼ ℜ𝑛, 𝑓∗ ∈ 𝐻
• How to PAC learn H?
– Natural idea: fit the training data
– Empirical Risk Minimization: find 𝑓 such that 𝑒𝑟𝑟𝑆𝑚 𝑓 = 0
![Page 9: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/9.jpg)
ERM as a universal algorithm
• Let A be the empirical risk minimizer
Theorem:For any finite H, ERM PAC learns H provided
𝑚 ≥1
𝜖log(
𝐻
𝛿)
![Page 10: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/10.jpg)
ERM as a universal algorithm
Theorem:For any finite H, ERM PAC learns H provided
𝑚 ≥1
𝜖log(
𝐻
𝛿)
![Page 11: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/11.jpg)
ERM as a universal algorithm
Theorem:For any finite H, ERM PAC learns H provided
𝑚 ≥1
𝜖log(
𝐻
𝛿)
![Page 12: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/12.jpg)
ERM as a universal algorithm
Theorem:For any finite H, ERM PAC learns H provided
𝑚 ≥1
𝜖log(
𝐻
𝛿)
What if H is infinite or uncountable?
![Page 13: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/13.jpg)
1-d example
![Page 14: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/14.jpg)
1-d example
![Page 15: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/15.jpg)
1-d example
![Page 16: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/16.jpg)
1-d example
Should be able to replace |H| with the number of“different” functions on 𝑆𝑚.
![Page 17: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/17.jpg)
ERM as a universal algorithm
Theorem:For any H, ERM PAC learns H provided
𝑚 ≥4
𝜖log(
2𝐶[2𝑚]
𝛿)
𝐶 2𝑚 = Maximum # ways to label any set of 2𝑚 points using functions in H.
= Maximum # functions induced on a set of 2m points by 𝐻.
![Page 18: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/18.jpg)
ERM as a universal algorithm
Theorem:For any H, ERM PAC learns H provided
𝑚 ≥4
𝜖log(
2𝐶[2𝑚]
𝛿)
𝐶 2𝑚 = Maximum # ways to label any set of 2𝑚 points using functions in H.
= Maximum # functions induced on a set of 2m points by 𝐻.
![Page 19: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/19.jpg)
1-d example
Theorem:For any H, ERM PAC learns H provided
𝑚 ≥4
𝜖log(
2𝐶[2𝑚]
𝛿)
![Page 20: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/20.jpg)
2-d example
Theorem:For any H, ERM PAC learns H provided
𝑚 ≥4
𝜖log(
2𝐶[2𝑚]
𝛿)
![Page 21: Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory Instructor: Pranjal Awasthi. Announcements •HW 3 out today. Due March 24](https://reader034.vdocument.in/reader034/viewer/2022051922/600fc35d35b4fa73f35e7bf2/html5/thumbnails/21.jpg)
ERM as a universal algorithm
Theorem:For any H, ERM PAC learns H provided
𝑚 ≥4
𝜖log(
2𝐶[2𝑚]
𝛿)
𝐶 2𝑚 = maximum # ways to label any set of 2𝑚 points using functions in H.
How to bound 𝐶[2𝑚] for a general class?