machine learning and data mining i · 2018. 10. 10. · –save jupyter notebook as pdf or generate...

42
DS 4400 Alina Oprea Associate Professor, CCIS Northeastern University October 9 2018 Machine Learning and Data Mining I

Upload: others

Post on 02-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

DS 4400

Alina Oprea

Associate Professor, CCIS

Northeastern University

October 9 2018

Machine Learning and Data Mining I

Page 2: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Logistics

• HW 2 is on Piazza– Due date: Friday, Oct 19– Save Jupyter Notebook as PDF or generate PDF with results– We run your code selectively, but if code doesn’t run we

would like to see results– Make sure that permissions are set so we can access your

files if using cloud service– Hosting it on Github privately is the best option

• Grading for HW 1 almost done– Grades will be posted on Gradescope

• Midterm next Tuesday, Oct. 16– Material until Decision trees and Bagging (including

everything in lecture today)

2

Page 3: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Review

• Metrics for evaluating classifiers– Accuracy, error, precision, recall, F1 score– AUC (area under the ROC curve) measures performance of

classifier for different thresholds

• Feature selection methods– Filters decide on each feature individually– Wrappers select a subset of features by search strategy

(fixing model and evaluating with cross-validation)– Embedded methods (e.g., regularization) are part of

training

• Decision trees are interpretable, non-linear models– Greedy algorithm to train Decision Trees– Works on categorical and numerical data – Node splitting done by highest Information Gain

3

Page 4: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Outline

• Decision trees

– ID3 algorithm (use Information Gain for splitting)

– Solutions against overfitting (e.g., pruning)

• Lab

• Ensemble learning

– Reduce variance

– Decrease classification error

• Bagging method for designing ensemble learning

4

Page 5: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Sample Dataset

5

Categorical data

Page 6: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Decision Tree

6

Page 7: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Learning Decision Trees

7

Page 8: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Full Tree

8

Page 9: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Splitting

9

Split by the node that reduces uncertainty

Page 10: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Information Gain

10

Page 11: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Example

11

Page 12: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Example Information Gain

12

Pure node

Max Information

Gain

Page 13: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Learning Decision Trees

13

ID3 algorithm uses Information GainInformation Gain reduces uncertainty on Y

Page 14: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

When to stop?

14

Page 15: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Case 1

15

Page 16: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Case 2

16

Page 17: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Overfitting

17

Page 18: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Solutions against Overfitting

18

• Pruning

Page 19: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Pruning Decision Trees

19

Page 20: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Pruning Decision Trees

20

Page 21: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Real-Valued Inputs

21

Page 22: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Naïve Approach

22

Page 23: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Threshold Splits

23

Information Gain metric can be extended to numerical attributes

Page 24: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Interpretability

24

Page 25: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Decision Boundary

25

Page 26: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Decision Trees vs Linear Models

26Linear model Decision tree

Page 27: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Lab

27

Page 28: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Lab

28

Add Label “High” is Sales > 8

Page 29: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Lab

29

Train and Test

Accuracy

Page 30: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Lab

30

Page 31: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

31

Page 32: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Pruning

32

• Cross-validation for pruning• FUN = prune.misclass indicates that classification

error is metric to minimize

Page 33: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Pruning

33

Page 34: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Outline

• Decision trees

– ID3 algorithm (use Information Gain for splitting)

– Overfitting solutions (e.g., pruning)

• Lab

• Ensemble learning

– Reduce variance

– Decrease classification error

• Bagging method for designing ensemble learning

34

Page 35: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

35

Decision Trees

How to reduce variance of single decision tree?

Page 36: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Ensemble Learning

36

Page 37: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Build Ensemble Classifiers• Basic idea

– Build different “experts”, and let them vote

• Advantages

– Improve predictive performance

– Easy to implement

– No too much parameter tuning

• Disadvantages

– The combined classifier is not transparent and interpretable

– Not a compact representation

37

Page 38: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Why do they work?

• Suppose there are 25 base classifiers

• Each classifier has error rate,

• Assume independence among classifiers

• Probability that the ensemble classifier makes a wrong prediction:

06.0)1( 25

25i25

13

=−

=

i

i i

35.0=

38

Page 39: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Practical Applications

39

Page 40: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Netflix Prize

40

Page 41: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

General Idea

41Majority Votes

Page 42: Machine Learning and Data Mining I · 2018. 10. 10. · –Save Jupyter Notebook as PDF or generate PDF with results –We run your code selectively, but if code doesn’t run we

Acknowledgements

• Slides made using resources from:

– Andrew Ng

– Eric Eaton

– David Sontag

– Andrew Moore

• Thanks!

42