introduction to machine learning aristotelis tsirigos

Introduction to Machine Learning

Aristotelis Tsirigosemail: [email protected]

Dennis Shasha - Advanced Database SystemsNYU Computer Science

2

What is Machine Learning?

• Principles, methods and algorithms for predicting using past experience

• Not mere memorization, but capability of generalizing on novel situations based on past experience

• Where no theoretical model exists to explain the data, machine learning can be employed to offer such a model

3

Learning models

According to how active or passive the learner is:

• Statistical learning model– No control over observations, they are presented at random in

an independent identically distributed fashion

• Online model– An external source presents the observations to the learner in a

query form

• Query model– The learner is querying an external “expert” source

4

Types of learning problems

A very rough categorization of learning problems:

• Unsupervised learning– Clustering, density estimation, feature selection

• Supervised learning– Classification, regression

• Reinforcement learning– Feedback, games

5

Outline

• Learning methods– Bayesian learning– Nearest neighbor– Decision trees– Linear classifiers– Ensemble methods (bagging & boosting)

• Testing the learner

• Learner evaluation

• Practical issues

• Resources

6

Bayesian learning - Introduction

• Given are:– Observed data D = { d1, d2, …, dn }

– Hypothesis space H

• In the Bayesian setup we want to find the hypothesis that best fits the data in a probabilistic manner:

• In general, it is computationally intractable without any assumptions about the data

{ } )|( argmaxmax DhPhHh∈

=

7

• First transformation using Bayes rule:

• Now we get:

• Notice that the optimal choice depends also on the a priori probability P(h) of hypothesis h

Bayesian learning - Elaboration

)(

)()|()|(

DP

hPhDPDhP =

{ } )()|( argmaxmax hPhDPhHh∈

=

8

Bayesian learning - Independence

• We can further simplify by assuming that the data in D are independently drawn from the underlying distribution:

• Under this assumption:

• How do we estimate P(di|h) in practice?

∏=

=n

ii hdPhDP

1

)|()|(

= ∏

=∈

n

ii

HhhdPhPh

1max )|()(argmax

9

Bayesian learning - Analysis

• For any hypothesis h in H we assume a distribution:– P(d|h) for any point d in the input space

• If d is a point in an m-dimensional space, then the distribution is in fact:– P(d(1),d(2),…,d(m)|h)

• Problems:– Complex optimization problem, suboptimal techniques can be

used if the distributions are differentiable

– In general, there are too many parameters to estimate, therefore a lot of data is needed for reliable estimation

10

Bayesian learning - Analysis

• Need to further analyze distribution P(d|h):– We can assume features are independent (Naïve Bayes):

– Or, build Bayesian Networks where dependencies of the features are explicitly modeled

• Still we have to somehow learn the distributions– Model as parametrized distributions (e.g. Gaussians)

– Estimate the parameters using standard greedy techniques (e.g. Expectation Maximization)

∏=

=m

k

k hdPhdP1

)( )|()|(

11

Bayesian learning - Summary

• Makes use of prior knowledge of:– The likelihood of alternative hypotheses and– The probability of observing data given a specific hypothesis

• The goal is to determine the most probable hypothesis given a series of observations

• The Naïve Bayes method has been found useful in practical applications (e.g. text classification)

• If the naïve assumption is not appropriate, there is a generic algorithm (EM) that can be used to find a hypothesis that is locally optimal

12

Nearest Neighbor - Introduction

• Belongs to class of instance-based learners:– The learner does not make any global prediction of the target

function, only predicts locally for a given point (lazy learner)

• Idea: – Given a query instance x, look at past observations D that are

“close” to x in order to determine x’s class y.

• Issues:– How do we define distance?– How do we define the notion of “neighborhood”?

13

Nearest Neighbor - Details

• Classify new instance x according to its neighborhood N(x)• Neighborhood can be defined in different ways:

– Constant radius– k-Neighbors

• Weights are a function of distance: wi = w(d(x,xi))

= ∑∈ )(:

)(xNxi

ii

i

ywsignxf

Classification rule:

yi : label for point xi

wi : weight for xi

x

N(x)

14

Nearest Neighbor - Summary

• Classify new instances according to their closest points

• Control accuracy in three ways:– Distance metric

– Definition of neighborhood

– Weight assignment

• These parameters must be tuned depending on the problem:– Is there noise in the data?

– Outliers?

– What is a “natural” distance for the data?

15

Decision Trees - Introduction• Suppose data is categorical

• Observation:– Distance cannot be defined in a natural way– Need a learner that operates directly on the attribute values

YESNOLOWHIGHBAD

NOYESLOWHIGHBAD

NONOHIGHLOWBAD

NONOLOWLOWBAD

YESYESLOWLOWGOOD

NONOHIGHLOWGOOD

YESNOLOWHIGHGOOD

YESYESLOWHIGHGOOD

YESNOHIGHHIGHGOOD

YESYESHIGHHIGHGOOD

ElectedWar casualtiesGas pricesPopularityEconomy

16

Decision Trees - The model • Idea: a decision tree that “explains” the data

• Observation:– In general, there is no unique tree to represent the data– In some nodes the decision is not strongly supported by the data

Economy

Popularity

WarGas prices

Popularity

GOOD

LOWHIGHHIGH LOW

BAD

HIGH LOW YES NO

YES=4NO=0

YES=0NO=1

YES=1NO=0

YES=0NO=1

YES=1NO=0

YES=0NO=2

17

Decision Trees - Training

• Build the tree from top to bottom choosing one attribute at a time

• How do we make the choice? Idea:– Choose the most “informative” attribute first– Having no other information, which attribute allows us to classify

correctly most of the time?

• This can be quantified using the Information Gain metric:– Based on Entropy = Randomness

– Measures the reduction in uncertainty about the target value given the value of one of the attributes, therefore it tells us how informative that attribute is

18

Decision Trees - Overfitting

• Problems:– Solution is greedy, therefore suboptimal– Optimal solution infeasible due to time constraints– Is optimal really optimal? What if observations are corrupted by

noise?

• We are really interested in the true, not the training error– Overfitting in the presence of noise

– Occam’s razor: prefer simpler solutions

– Apply pruning to eliminate nodes with low statistical support

19

Decision Trees - Pruning• Get rid of nodes with low support

Economy

Popularity

WarGas prices

Popularity

GOOD

LOWHIGHHIGH LOW

BAD

HIGH LOW YES NO

YES=4NO=0

YES=0NO=1

YES=1NO=0

YES=0NO=1

YES=1NO=0

YES=0NO=2

• The pruned tree does not fully explain the data, but we hope that it will generalize better on unseen instances…

20

Decision Trees - Summary

• Advantages:– Categorical data– Easy to interpret in a simple rule format

• Disadvantages– Hard to accommodate numerical data

– Suboptimal solution

– Bias towards simple trees

21

Linear Classifiers - Introduction

Decision function:

Predicted label:

• There is an infinite number of hyperplanes f(x)=0 that can separate positive from negative examples!

• Is there an optimal one to choose?

22

Linear Classifiers - Margins• Make sure it leaves enough …room for future points!

• Margins:– For a training point xi define its margin:

– For classifier f it is the worst of all margins:

f(x)=0

wxfy iii )(=γ{ }ifM γmin)( =

23

Linear Classifiers - Optimization

• Now, the only thing we have to do is find the f with the maximum possible margin!– Quadratic optimization problem:

,...,n,ibwxf

ww

i 21 ,1),,(y subject to

w minimize

i

22

b,w

=≥

⋅=

• This classifier is known as Support Vector Machines

– The optimal w* will yield the maximum margin γ*:1**

−= wγ

– Finally, the optimal hyperplane can be written as:

**** ),,( bxxybwxfSVi

iii +⋅= ∑∈

α

24

Linear Classifiers - Problems

• What if data is noisy or, worse, linearly inseparable?• Solution 1:

– Allow for outliers in the data when data is noisy

• Solution 2:– Increase dimensionality by creating composite features if the

target function is nonlinear

• Solution 3:– Do both 1 and 2

25

Linear Classifiers - Outliers

• Impose softer restrictions on the margin distribution to accept outliers

f(x)=0

• Now our classifier is more flexible and more powerful, but there are more parameters to estimate

outlier

26

Linear Classifiers - Nonlinearity (!)• Combine input features to form more complex ones:

– Initial space x = (x1,x2,…,xm)– Induced space Φ(x) = (x1,x2,…,xm,2x1x2,2x1x3,…,2xm-1xm)– Inner product can now be written as <Φ(x)·Φ(y)> = <x·y>2

• Kernels:– The above product is denoted K(x,y)= <Φ(x)·Φ(y)> and it is

called a kernel– Kernels induce nonlinear feature spaces based on the initial

feature space – There is a huge collection of kernels, for vectors, trees, strings,

graphs, time series, …

• Linear separation in the composite feature space implies a nonlinear separation in the initial space!

27

Linear Classifiers - Summary

• Provides a generic solution to the learning problem

• We just have to solve an easy optimization problem

• Parametrized by the induced feature space and the noise parameters

• There exist theoretical bounds on their performance

28

Ensembles - Introduction

• Motivation: – finding just one classifier is “too risky”

• Idea:– combine a group of classifiers into the final learner

• Intuition:– each classifier is associated with some risk of wrong predictions

in future data

– instead of investing in just one risky classifier, we can distribute the decision in many classifiers thus effectively reducing the overall risk

29

Ensembles - Bagging

• Main idea:– From observations D, T subsets D1,…,DT are drawn at random

– For each Di train a “base” classifier fi (e.g. decision tree)

– Finally, combine the T classifiers into one classifier f by taking a majority vote:

= ∑

=

T

ii xfsignxf

1

)()(

• Observations:– Need enough observations to get partitions that approximately

respect the iid condition ( |D| >> T )– How do we decide on the base classifier?

30

Ensembles - Boosting

• Main idea:– Run through a number of iterations

– At each iteration t, a “weak” classifier ft is trained on a weighted version of the training data (initially weights are equal)

– Each point’s weight is updated so that examples with poor margin with respect to ft are assigned a higher weight in an attempt to “boost” them in the next iterations

– The classifier itself is assigned a weight αt according to its training error

• Combine all classifiers into a weighted majority vote:

= ∑

ttt xfsignxf )()( α

31

Bagging vs. Boosting

• Two distinct ways to apply the diversification idea

Margin maximizationRisk minimizationEffect

SimpleComplex Base learner

Adaptive data weightingPartition before trainingTraining data

BoostingBagging

32

Testing the learner

• How do we estimate the learner’s performance?

• Create test sets from the original observations:– Test set:

• Partition into training and test sets and use error on test set as an estimate of the true error

– Leave-one-out: • Remove one point and train the rest, then report error on this point

• Do it for all points and report mean error

– k-fold Cross Validation:• Randomly partition the data set in k non-overlapping sets• Choose one set at a time for testing and train on the rest• Report mean error

33

Learner evaluation - PAC learning

• Probably Approximate Correct (PAC) learning of a target class C using hypothesis space H:– If for all target functions f in C and for all 0<ε,δ<1/2, with

probability at least (1-δ) we can learn a hypothesis h in H that approximates f with an error at most ε.

• Generalization (or true) error:– We really care about the error in unseen data

• Statistical learning theory gives as the tools to express the true error (and its confidence) in terms of:– The empirical (or training) error– The confidence 1-δ of the true error– The number of training examples– The complexity of the classes C and/or H

34

Learner evaluation - VC dimension

• The VC dimension of a hypothesis space H measures its power to interpret the observations

• Infinite hypothesis space size does not necessarily imply infinite VC dimension!

• Bad news: – If we allow a hypothesis space with infinite VC dimension,

learning is impossible (requires infinite number of observations)

• Good news:– For the class of large margin linear classifiers the following error

bound can be proven:

+≤

δγγ

γ4

log32

log8

log642

)(2

2 n

R

enR

nferror

35

Practical issues

• Machine learning is driven by data, so a good learner must be data-dependent in all aspects:– Hypothesis space

– Prior knowledge

– Feature selection and composition

– Distance/similarity measures

– Outliers/Noise

• Never forget that learners and training algorithms must be efficient in time and space with respect to:– The feature space dimensionality

– The training set size

– The hypothesis space size

36

Conclusions

• Machine learning is mostly art and a little bit of science!

• For each problem at hand a different classifier will be the optimal one

• This simply means that the solution must be data-dependent:– Select an “appropriate” family of classifiers (e.g. Decision Trees)– Choose the right representation for the data in the feature space– Tune available parameters of your favorite classifier to reflect the

“nature” of the data

• Many practical applications, especially when there is no good theory available to model the data

37

Resources

• Books– T. Mitchell, Machine Learning– N. Cristianini & Shawe-Taylor, An introduction to Support Vector

Machines– V. Kecman, Learning and Soft Computing– R. Duda, P. Hart & D. Stork, Pattern Classification

• Online tutorials– A. Moore, http://www-2.cs.cmu.edu/~awm/tutorials/

• Software– WEKA: http://www.cs.waikato.ac.nz/~ml/weka/– SVMlight: http://svmlight.joachims.org/

introduction to machine learning aristotelis tsirigos

Documents

learning methods bayesian

bayesian learning analysis

learning models

bayesian learning summary

statistical learning

types of learning problems

bayesian networks

bayesian setup