midterm exam - university of pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-spring2014/...•...

21
1 CS 2750 Machine Learning CS 2750 Machine Learning Lecture 13 Milos Hauskrecht [email protected] 5329 Sennott Square Multiclass classification Decision trees CS 2750 Machine Learning Midterm exam Midterm Tuesday, March 4, 2014 In-class (75 minutes) closed book material covered by February 27, 2014

Upload: others

Post on 22-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

1

CS 2750 Machine Learning

CS 2750 Machine LearningLecture 13

Milos Hauskrecht

[email protected]

5329 Sennott Square

Multiclass classificationDecision trees

CS 2750 Machine Learning

Midterm exam

Midterm Tuesday, March 4, 2014

• In-class (75 minutes)

• closed book

• material covered by February 27, 2014

Page 2: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

2

CS 2750 Machine Learning

Project proposals

Due: Thursday, March 20, 2014

• 1 page long

Proposal

• Written proposal:

1. Outline of a learning problem, type of data you have available. Why is the problem important?

2. Learning methods you plan to try and implement for the problem. References to previous work.

3. How do you plan to test, compare learning approaches

4. Schedule of work (approximate timeline of work)

CS 2750 Machine Learning

Project proposals

Where to find the data:

• From your research

• UC Irvine data repository

• Various text document repositories

• I have some bioinformatics data I can share but other data can be found on the NIH or various university web sites

(e.g. microarray data, proteomic data)

• Synthetic data that are generated to demonstrate your algorithm works

Page 3: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

3

CS 2750 Machine Learning

Project proposals

Problems to address:

• Get the ideas for the project by browsing the web

• It is tempting to go with simple classification but definitely try to add some complexity to your investigations

• Multiple, not just one method, try some more advanced methods, say those that combine multiple classifiers to learn a model (ensemble methods) or try to modify the existing methods

CS 2750 Machine Learning

Project proposals

Interesting problems to consider:

• Advanced methods for learning multi-class problems

• Learning the parameters and structure of Bayesian Belief networks

• Clustering of data – how to group examples

• Dimensionality reduction/feature selection – how to deal with a large number of inputs

• Learning how to act – Reinforcement learning

• Anomaly detection – how to identify outliers in data

Page 4: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

4

CS 2750 Machine Learning

Multiclass classification

• Binary classification

• Multiclass classification

– K classes

– Goal: learn to classify correctly K classes

– Or learn

• Errors:

– Zero-one (misclassification) error for an example:

– Mean misclassification error (for a dataset):

}1,,1,0{ KY

}1,,1,0{: KXf

ii

iiii yf

yfyError

),(0

),(1),(1 wx

wxx

}1,0{Y

),(1

11

ii

n

i

yrrorEn

x

CS 2750 Machine Learning

Multiclass classification

Approaches:

• Generative model approach

– Generative model of the distribution p(x,y)

– Learns the parameters of the model through density estimation techniques

– Discriminant functions are based on the model

• “Indirect” learning of a classifier

• Discriminative approach

– Parametric discriminant functions

– Learns discriminant functions directly

• A logistic regression model.

Page 5: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

5

CS 2750 Machine Learning

Generative model approach

Indirect:

1. Represent and learn the distribution

2. Define and use probabilistic discriminant functions

Model

• = Class-conditional distributions (densities)

k class-conditional distributions

• = Priors on classes

• - probability of class y

1)(1

1

iypK

i

y

x

),( yp x

)()|(),( ypypyp xx )|( yp x

10)|( Kiiiyp x

)( yp

)|(log)( xx iypg i

CS 2750 Machine Learning

Multi-way classification. Example

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

0

12

Page 6: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

6

CS 2750 Machine Learning

Multiclass classification

CS 2750 Machine Learning

Making class decision

Discriminant functions can be based on:

• Likelihood of data – choose the class (Gaussian) that explains the input data (x) better (likelihood of the data)

• Posterior of a class – choose the class with higher posterior probability

)|(maxarg1,0

iki

pi θx

)()|(

)()|()|( 1

0

jypp

iyppiyp

j

k

j

i

x

xx

Choice:

),|()|( iii pp Σxθx For Gaussians

),|(maxarg1,0

iki

iypi θx

Choice:

Page 7: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

7

CS 2750 Machine Learning

Discriminative approach

• Parametric model of discriminant functions

• Learns the discriminant functions directly

How to learn to classify multiple classes, say 0,1,2?

Approach 1:

• A binary logistic regression on every class versus the rest (OvR)

0 vs. (1 or 2)

1 vs. (0 or 2)

2 vs. (0 or 1)

1

1x

dx

CS 2750 Machine Learning

Multiclass classification. Example

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

0

12

Page 8: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

8

CS 2750 Machine Learning

Multiclass classification. Approach 1.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

0 vs {1,2}

1 vs {0,2}

2 vs {0,1}

0

12

CS 2750 Machine Learning

Multiclass classification. Approach 1.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

Region of nobody

0 vs {1,2}

1 vs {0,2}2 vs {0,1}

0

12

Ambiguousregion

Page 9: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

9

CS 2750 Machine Learning

Multiclass classification. Approach 1.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

0 vs {1,2}

1 vs {0,2}2 vs {0,1}

0

12

CS 2750 Machine Learning

Discriminative approach.

How to learn to classify multiple classes, say 0,1,2 ?

Approach 2:

– A binary logistic regression on all pairs

0 vs. 1

0 vs. 2

1 vs. 2

1

1x

dx

Page 10: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

10

CS 2750 Machine Learning

Multiclass classification. Example

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

0

12

CS 2750 Machine Learning

Multiclass classification. Approach 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

1 vs 2

0 vs 1

0 vs 2

0

12

Page 11: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

11

CS 2750 Machine Learning

Multiclass classification. Approach 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

Ambiguousregion

1 vs 2

0 vs 1

0 vs 2

0

12

CS 2750 Machine Learning

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3

Multiclass classification. Approach 2

1 vs 2

0 vs 1

0 vs 2

0

12

Page 12: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

12

CS 2750 Machine Learning

Multiclass classification with softmax

j

Tj

Ti

iiyp)exp(

)exp()|(

xw

xwx

ii 1

• A solution to the problem of having an ambiguous region

1

1x

dx

1

2

softmax

00z

1z

2z

CS 2750 Machine Learning

Multiclass classification with softmax

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1

-0.5

0

0.5

1

1.5

2

2.5

3 0

12

Page 13: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

13

CS 2750 Machine Learning

Learning of the softmax model

• Learning of parameters w: statistical view

Multi-wayCoin toss

1

..

0

0

..

0

..

1

0

0

..

0

1

y

x)|0(0 x yP

)|1(1 x kyPkySoftmax

network

Assume outputs y are transformed as follows

1..10 ky

CS 2750 Machine Learning

Learning of the softmax model

• Learning of the parameters w: statistical view

• Likelihood of outputs

• We want parameters w that maximize the likelihood

• Log-likelihood trick

– Optimize log-likelihood of outputs instead:

• Objective to optimize

ni

iini

ypypDl,..1,..1

)|(log)|(log),( wx,wx,w

1

0,,

1

log),(k

qqiqi

n

ii yDJ w

ni

qi

k

qqi

ni

yi

k

q

yqi

,..1,

1

0,

,..1

1

0

loglog ,

)|(),|(),(,..1

w,xXYw iini

ypwpDL

Page 14: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

14

CS 2750 Machine Learning

Learning of the softmax model

• Error to optimize:

• Gradient

• The same very easy gradient update as used for the binary logistic regression

• But now we have to update the weights of k networks

1

0,,

1

log),(k

qqiqi

n

ii yDJ w

)(),( ,,,1

qiqiji

n

ii

jq

yxDJw

w

n

iiqiqiqq y

1,, )( xww

CS 2750 Machine Learning

Multi-way classification

• Yet another approach 3

Not 2

2 vs 3

Not 3

Not 2 Not 1 Not 3

1 3

Not 1

3 vs 1

2 vs 1

2

Page 15: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

15

CS 2750 Machine Learning

Decision trees

• An alternative approach to classification:

– Partition the input space to regions

– Regress or classify independently in every region

1 1

1

0 0

0 0

00

01

1 1

0 00

01

1x

2x

CS 2750 Machine Learning

Decision trees

• An alternative approach to classification:

– Partition the input space to regions

– Regress or classify independently in every region

1 1

1

0 0

0 0

00

01

1 1

0 00

01

2x

1x

Page 16: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

16

CS 2750 Machine Learning

Decision trees

• Decision tree model:

– Split the space recursively according to inputs in x

– Classify at the bottom of the tree

03 xx

t f

01 x 02 xt tf f

Example:Binary classification Binary attributes

1 0 0 1

0

1 0

321 ,, xxx}1,0{

classify

02 x

CS 2750 Machine Learning

Decision trees

• Decision tree model:

– Split the space recursively according to inputs in x

– Classify at the bottom of the tree

03 x

)0,0,1(),,( 321 xxxx

t f

01 x 02 xt tf f

Example:Binary classification Binary attributes

1 0 0 1

0

1 0

321 ,, xxx}1,0{

classify

02 x

Page 17: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

17

CS 2750 Machine Learning

Decision trees

• Decision tree model:

– Split the space recursively according to inputs in x

– Classify at the bottom of the tree

03 x

)0,0,1(),,( 321 xxxx

t f

01 x 02 xt tf f

Example:Binary classification Binary attributes

1 0 0 1

0

1 0

321 ,, xxx}1,0{

classify

02 x

CS 2750 Machine Learning

Decision trees

• Decision tree model:

– Split the space recursively according to inputs in x

– Classify at the bottom of the tree

03 x

)0,0,1(),,( 321 xxxx

t f

01 x 02 xt tf f

Example:Binary classification Binary attributes

1 0 0 1

0

1 0

321 ,, xxx}1,0{

classify

02 x

Page 18: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

18

CS 2750 Machine Learning

Decision trees

• Decision tree model:

– Split the space recursively according to inputs in x

– Classify at the bottom of the tree

03 x

)0,0,1(),,( 321 xxxx

t f

01 x 02 xt tf f

Example:Binary classification Binary attributes

1 0 0 1

0

1 0

321 ,, xxx}1,0{

classify

02 x

CS 2750 Machine Learning

Learning decision trees

How to construct /learn the decision tree?

• Top-bottom algorithm:

– Find the best split condition (quantified

based on the impurity measure)

– Stops when no improvement possible

• Impurity measure I:

– measures how well are the two classes in the training data D separated …. I(D)

– Ideally we would like to separate all 0s and 1

• Splits: finite or continuous value attributes

Continuous value attributes conditions: 5.03 x

x

t f?

Page 19: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

19

CS 2750 Machine Learning

Impurity measure

Let

Impurity measure I(D)

• defines how well the classes are separated

• in general the impurity measure should satisfy:

– Largest when data are split evenly for attribute values

– Should be 0 when all data belong to the same class

||

||

D

Dp i

i

|| D - Total number of data entries in the training dataset

|| iD - Number of data entries classified as i

- ratio of instances classified as i

classesofnumber

1ip

CS 2750 Machine Learning

Impurity measures

• There are various impurity measures used in the literature

– Entropy based measure (Quinlan, C4.5)

– Gini measure (Breiman, CART)

k

iii ppDEntropyDI

1

log)()(

k

iipDGiniDI

1

21)()(

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Example for k=2

Page 20: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

20

CS 2750 Machine Learning

Impurity measures

• Gain due to split – expected reduction in the impurity measure (entropy example)

)(||

||)(),(

)(

AValuesv

vv

DEntropyD

DDEntropyADGain

|| vD - a partition of D with the value of attribute A = v

03 xx

t f

0

03 xx

t f

01 xt f

0)(DEntropy

)(DEntropy

)( tDEntropy )( fDEntropy

CS 2750 Machine Learning

Decision tree learning

• Greedy learning algorithm:

Repeat until no or small improvement in the purity

– Find the attribute with the highest gain

– Add the attribute to the tree and split the set accordingly

• Builds the tree in the top-down fashion

– Gradually expands the leaves of the partially built tree

• The method is greedy

– It looks at a single attribute and gain in each step

– May fail when the combination of attributes is needed to improve the purity (parity functions)

Page 21: Midterm exam - University of Pittsburghpeople.cs.pitt.edu/~milos/courses/cs2750-Spring2014/...• Advanced methods for learning multi-class problems • Learning the parameters and

21

CS 2750 Machine Learning

Decision tree learning

• Limitations of greedy methods

Cases in which a combination of two or more attributes improves the impurity

1 11 1

0 0

00

0 00

0

11

1

1

CS 2750 Machine Learning

Decision tree learning

By reducing the impurity measure we can grow very large trees

Problem: Overfitting

• We may split and classify very well the training set, but we may do worse in terms of the generalization error

Solutions to the overfitting problem:

• Solution 1.

– Prune branches of the tree built in the first phase

– Use validation set to test for the overfit

• Solution 2.

– Test for the overfit in the tree building phase

– Stop building the tree when performance on the validation set deteriorates