machine learning and data mining i · 2018. 10. 10. · –save jupyter notebook as pdf or generate...
TRANSCRIPT
![Page 1: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/1.jpg)
DS 4400
Alina Oprea
Associate Professor, CCIS
Northeastern University
October 9 2018
Machine Learning and Data Mining I
![Page 2: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/2.jpg)
Logistics
• HW 2 is on Piazza– Due date: Friday, Oct 19– Save Jupyter Notebook as PDF or generate PDF with results– We run your code selectively, but if code doesn’t run we
would like to see results– Make sure that permissions are set so we can access your
files if using cloud service– Hosting it on Github privately is the best option
• Grading for HW 1 almost done– Grades will be posted on Gradescope
• Midterm next Tuesday, Oct. 16– Material until Decision trees and Bagging (including
everything in lecture today)
2
![Page 3: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/3.jpg)
Review
• Metrics for evaluating classifiers– Accuracy, error, precision, recall, F1 score– AUC (area under the ROC curve) measures performance of
classifier for different thresholds
• Feature selection methods– Filters decide on each feature individually– Wrappers select a subset of features by search strategy
(fixing model and evaluating with cross-validation)– Embedded methods (e.g., regularization) are part of
training
• Decision trees are interpretable, non-linear models– Greedy algorithm to train Decision Trees– Works on categorical and numerical data – Node splitting done by highest Information Gain
3
![Page 4: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/4.jpg)
Outline
• Decision trees
– ID3 algorithm (use Information Gain for splitting)
– Solutions against overfitting (e.g., pruning)
• Lab
• Ensemble learning
– Reduce variance
– Decrease classification error
• Bagging method for designing ensemble learning
4
![Page 5: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/5.jpg)
Sample Dataset
5
Categorical data
![Page 6: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/6.jpg)
Decision Tree
6
![Page 7: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/7.jpg)
Learning Decision Trees
7
![Page 8: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/8.jpg)
Full Tree
8
![Page 9: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/9.jpg)
Splitting
9
Split by the node that reduces uncertainty
![Page 10: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/10.jpg)
Information Gain
10
![Page 11: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/11.jpg)
Example
11
![Page 12: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/12.jpg)
Example Information Gain
12
Pure node
Max Information
Gain
![Page 13: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/13.jpg)
Learning Decision Trees
13
ID3 algorithm uses Information GainInformation Gain reduces uncertainty on Y
![Page 14: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/14.jpg)
When to stop?
14
![Page 15: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/15.jpg)
Case 1
15
![Page 16: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/16.jpg)
Case 2
16
![Page 17: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/17.jpg)
Overfitting
17
![Page 18: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/18.jpg)
Solutions against Overfitting
18
• Pruning
![Page 19: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/19.jpg)
Pruning Decision Trees
19
![Page 20: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/20.jpg)
Pruning Decision Trees
20
![Page 21: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/21.jpg)
Real-Valued Inputs
21
![Page 22: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/22.jpg)
Naïve Approach
22
![Page 23: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/23.jpg)
Threshold Splits
23
Information Gain metric can be extended to numerical attributes
![Page 24: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/24.jpg)
Interpretability
24
![Page 25: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/25.jpg)
Decision Boundary
25
![Page 26: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/26.jpg)
Decision Trees vs Linear Models
26Linear model Decision tree
![Page 27: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/27.jpg)
Lab
27
![Page 28: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/28.jpg)
Lab
28
Add Label “High” is Sales > 8
![Page 29: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/29.jpg)
Lab
29
Train and Test
Accuracy
![Page 30: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/30.jpg)
Lab
30
![Page 31: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/31.jpg)
31
![Page 32: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/32.jpg)
Pruning
32
• Cross-validation for pruning• FUN = prune.misclass indicates that classification
error is metric to minimize
![Page 33: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/33.jpg)
Pruning
33
![Page 34: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/34.jpg)
Outline
• Decision trees
– ID3 algorithm (use Information Gain for splitting)
– Overfitting solutions (e.g., pruning)
• Lab
• Ensemble learning
– Reduce variance
– Decrease classification error
• Bagging method for designing ensemble learning
34
![Page 35: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/35.jpg)
35
Decision Trees
How to reduce variance of single decision tree?
![Page 36: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/36.jpg)
Ensemble Learning
36
![Page 37: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/37.jpg)
Build Ensemble Classifiers• Basic idea
– Build different “experts”, and let them vote
• Advantages
– Improve predictive performance
– Easy to implement
– No too much parameter tuning
• Disadvantages
– The combined classifier is not transparent and interpretable
– Not a compact representation
37
![Page 38: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/38.jpg)
Why do they work?
• Suppose there are 25 base classifiers
• Each classifier has error rate,
• Assume independence among classifiers
• Probability that the ensemble classifier makes a wrong prediction:
06.0)1( 25
25i25
13
=−
−
=
i
i i
35.0=
38
![Page 39: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/39.jpg)
Practical Applications
39
![Page 40: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/40.jpg)
Netflix Prize
40
![Page 41: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/41.jpg)
General Idea
41Majority Votes
![Page 42: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we](https://reader035.vdocument.in/reader035/viewer/2022071211/6022456316442a70261d0fa6/html5/thumbnails/42.jpg)
Acknowledgements
• Slides made using resources from:
– Andrew Ng
– Eric Eaton
– David Sontag
– Andrew Moore
• Thanks!
42